--- layout: post title: Adventures in XML Parsing --- I think pretty much everyone has realized at this point that XML is not very easy to parse. With a little documentation and a helpful parsing library it should be, at the very least, managable right? That's what I thought when I attempted to write a TMX parser for the first time. I quickly found out how much of a pain it is to parse XML even with the format documentation right in front of me and a robust library to work with. Seemingly random blank tags --------------------------- I think the main issue with the XML standard is just how many quirks a file can possibly have. Things like this: ```xml data <- There's a blank tag here! -> more data ``` That blank tag counted for the whitespace that *supposedly* exists there. When this is detected by LibXML2, a blank `` tag is placed in between two, otherwise valid, XML tags. Now when parsed by Python or Javascript or other languages where pointers are essentially non-existant, this should never come up. When parsed with a language like C however...well ```sh zsh: segmentation fault (core dumped) ./test ``` So, like any good developer, I ran it under Valgrind, and soon discovered that this is no ordinary memory fault. ```sh ==2373696== Invalid read of size 8 ==2373696== by _parse_layer(void*) ==2373696== at xmlStrEqual(nodePtr*, xmlChar*) ==2373696== Address 0x0 not stack'd, malloc'd or (recently) free'd ``` Now at this point, I'm thinking "Wait the bug is in LibXML? That can't be right." GDB, with liberal use of `bt` and `print` pointed at the same result: that the bug resided with LibXML. None of this made sense in the slightest. Why on earth would a professionally written software library that was essentially a standard fixture on many Unix-like systems have a major memory bug in it? The answer would not reveal itself until observing the program with hardware watchpoints. ```sh (gdb) watch *node Hardware watchpoint 2: *node (gdb) ... Hardware watchpoing 2: node Old value = (nodePtr *) 0x5555... New value = (nodePtr *) 0x0 ``` There you are! This was that pesky `` tag. Apparently, any, and I do mean *any*, whitespace detected by LibXML, including the space inserted by my level editor, produces this strange `` tag that seems to be there for no clear reason. Three new helper functions and judicious use of `nodePtr = nodePtr->next` later and that problem is solved. Comma separated values *inside* XML tags ---------------------------------------- XML can be a beast to parse, but CSV? I can parse that very easily using standard library functions like `strtok`. The problem came when the values in this list of values did not match the values inside the level editor. ![Tiled Map Editor](/assets/tiled.png) ```xml 49,50,50...50,51 97, . . . 145,146...146,147 ``` That first tile in the upper left corner? It has a GID of 48 inside the *editor*. In the *file*, it has GID of 49. This difference is not readily apparent from the editor or the file itself unless you know its there. This created the interesting case where my level looked pretty good in the editor but looked like someone placed tiles seemingly at random when my engine loaded the file into memory. ![](/assets/wrong_gid.png) The documentation failed to mention this too, making this all the more difficult to track down. Eventually, I did manage to figure out that when the tile GID is extracted from the file to just decrement the value. ```c for (int i = 0; i < count; i++) ret->tile_gids[i] = vals[i]-1; ``` Thankfully, it didn't require an hour worth of debugging inside GDB to find this. Conclusion ---------- I think that if you can get away with it, try to parse a binary file of your own creation rather than try to parse an existing standard. Why binary? Because you can more finely control how the data is formatted and parsing becomes a non-issue thanks to the ease of use of a standard C `FILE` pointer. I think the next thing I'll do is write a script that converts these XML files into something much more terse and less quirky.