summaryrefslogtreecommitdiff
path: root/_posts/2020-02-09-Adventures-in-XML-Parsing.md
blob: bfa844fd7e1316e01857e3d0a17f1c37aedc5fcc (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
layout: post
title: Adventures in XML Parsing
---

I think pretty much everyone has realized at this point that XML is not very
easy to parse. With a little documentation and a helpful parsing library it
should be, at the very least, managable right?

That's what I thought when I attempted to write a TMX parser for the first
time. I quickly found out how much of a pain it is to parse XML even with the
format documentation right in front of me and a robust library to work with.

Seemingly random blank tags
---------------------------

I think the main issue with the XML standard is just how many quirks a file can
possibly have. Things like this:

```xml
<doc>
        <element>data</element>
        <- There's a blank tag here! ->
        <element>more data</element>
</doc>
```

That blank tag counted for the whitespace that *supposedly* exists there. When
this is detected by LibXML2, a blank `<text>` tag is placed in between two,
otherwise valid, XML tags. Now when parsed by Python or Javascript or other
languages where pointers are essentially non-existant, this should never come
up. When parsed with a language like C however...well

```sh
zsh: segmentation fault (core dumped)   ./test
```

So, like any good developer, I ran it under Valgrind, and soon discovered that
this is no ordinary memory fault.

```sh
==2373696== Invalid read of size 8
==2373696==     by _parse_layer(void*)
==2373696==     at xmlStrEqual(nodePtr*, xmlChar*)
==2373696==  Address 0x0 not stack'd, malloc'd or (recently) free'd
```

Now at this point, I'm thinking "Wait the bug is in LibXML? That can't be
right." GDB, with liberal use of `bt` and `print` pointed at the same result:
that the bug resided with LibXML. None of this made sense in the slightest. Why
on earth would a professionally written software library that was essentially
a standard fixture on many Unix-like systems have a major memory bug in it? The
answer would not reveal itself until observing the program with hardware
watchpoints.

```sh
(gdb) watch *node
Hardware watchpoint 2: *node
(gdb)
...

Hardware watchpoing 2: node

Old value = (nodePtr *) 0x5555...
New value = (nodePtr *) 0x0
```

There you are! This was that pesky `<text>` tag. Apparently, any, and I do mean
*any*, whitespace detected by LibXML, including the space inserted by my level
editor, produces this strange `<text>` tag that seems to be there for no clear
reason. Three new helper functions and judicious use of
`nodePtr = nodePtr->next` later and that problem is solved.

Comma separated values *inside* XML tags
----------------------------------------

XML can be a beast to parse, but CSV? I can parse that very easily using
standard library functions like `strtok`. The problem came when the values in
this list of values did not match the values inside the level editor.

![Tiled Map Editor](/assets/tiled.png)

```xml
<data encoding="csv">
49,50,50...50,51
97,
.
.
.
145,146...146,147
</data>
```

That first tile in the upper left corner? It has a GID of 48 inside the
*editor*. In the *file*, it has GID of 49. This difference is not readily
apparent from the editor or the file itself unless you know its there. This
created the interesting case where my level looked pretty good in the editor but
looked like someone placed tiles seemingly at random when my engine loaded the
file into memory.

![](/assets/wrong_gid.png)

The documentation failed to mention this too, making this all the more difficult
to track down. Eventually, I did manage to figure out that when the tile GID is
extracted from the file to just decrement the value.

```c
for (int i = 0; i < count; i++)
        ret->tile_gids[i] = vals[i]-1;
```

Thankfully, it didn't require an hour worth of debugging inside GDB to find
this.

Conclusion
----------

I think that if you can get away with it, try to parse a binary file of your own
creation rather than try to parse an existing standard. Why binary? Because you
can more finely control how the data is formatted and parsing becomes a
non-issue thanks to the ease of use of a standard C `FILE` pointer. I think the
next thing I'll do is write a script that converts these XML files into
something much more terse and less quirky.