From 17b4dea56d7bae31fd8cb639966abe8c5542845f Mon Sep 17 00:00:00 2001 From: Danny Holman Date: Sun, 9 Feb 2020 23:16:30 -0600 Subject: Add new post about XML parsing Add a new post titled "Adventures in XML Parsing". --- _posts/2020-02-09-Adventures-in-XML-Parsing.md | 123 +++++++++++++++++++++++++ assets/tiled.png | Bin 0 -> 211256 bytes assets/wrong_gid.png | Bin 0 -> 39519 bytes 3 files changed, 123 insertions(+) create mode 100644 _posts/2020-02-09-Adventures-in-XML-Parsing.md create mode 100644 assets/tiled.png create mode 100644 assets/wrong_gid.png diff --git a/_posts/2020-02-09-Adventures-in-XML-Parsing.md b/_posts/2020-02-09-Adventures-in-XML-Parsing.md new file mode 100644 index 0000000..bfa844f --- /dev/null +++ b/_posts/2020-02-09-Adventures-in-XML-Parsing.md @@ -0,0 +1,123 @@ +--- +layout: post +title: Adventures in XML Parsing +--- + +I think pretty much everyone has realized at this point that XML is not very +easy to parse. With a little documentation and a helpful parsing library it +should be, at the very least, managable right? + +That's what I thought when I attempted to write a TMX parser for the first +time. I quickly found out how much of a pain it is to parse XML even with the +format documentation right in front of me and a robust library to work with. + +Seemingly random blank tags +--------------------------- + +I think the main issue with the XML standard is just how many quirks a file can +possibly have. Things like this: + +```xml + + data + <- There's a blank tag here! -> + more data + +``` + +That blank tag counted for the whitespace that *supposedly* exists there. When +this is detected by LibXML2, a blank `` tag is placed in between two, +otherwise valid, XML tags. Now when parsed by Python or Javascript or other +languages where pointers are essentially non-existant, this should never come +up. When parsed with a language like C however...well + +```sh +zsh: segmentation fault (core dumped) ./test +``` + +So, like any good developer, I ran it under Valgrind, and soon discovered that +this is no ordinary memory fault. + +```sh +==2373696== Invalid read of size 8 +==2373696== by _parse_layer(void*) +==2373696== at xmlStrEqual(nodePtr*, xmlChar*) +==2373696== Address 0x0 not stack'd, malloc'd or (recently) free'd +``` + +Now at this point, I'm thinking "Wait the bug is in LibXML? That can't be +right." GDB, with liberal use of `bt` and `print` pointed at the same result: +that the bug resided with LibXML. None of this made sense in the slightest. Why +on earth would a professionally written software library that was essentially +a standard fixture on many Unix-like systems have a major memory bug in it? The +answer would not reveal itself until observing the program with hardware +watchpoints. + +```sh +(gdb) watch *node +Hardware watchpoint 2: *node +(gdb) +... + +Hardware watchpoing 2: node + +Old value = (nodePtr *) 0x5555... +New value = (nodePtr *) 0x0 +``` + +There you are! This was that pesky `` tag. Apparently, any, and I do mean +*any*, whitespace detected by LibXML, including the space inserted by my level +editor, produces this strange `` tag that seems to be there for no clear +reason. Three new helper functions and judicious use of +`nodePtr = nodePtr->next` later and that problem is solved. + +Comma separated values *inside* XML tags +---------------------------------------- + +XML can be a beast to parse, but CSV? I can parse that very easily using +standard library functions like `strtok`. The problem came when the values in +this list of values did not match the values inside the level editor. + +![Tiled Map Editor](/assets/tiled.png) + +```xml + +49,50,50...50,51 +97, +. +. +. +145,146...146,147 + +``` + +That first tile in the upper left corner? It has a GID of 48 inside the +*editor*. In the *file*, it has GID of 49. This difference is not readily +apparent from the editor or the file itself unless you know its there. This +created the interesting case where my level looked pretty good in the editor but +looked like someone placed tiles seemingly at random when my engine loaded the +file into memory. + +![](/assets/wrong_gid.png) + +The documentation failed to mention this too, making this all the more difficult +to track down. Eventually, I did manage to figure out that when the tile GID is +extracted from the file to just decrement the value. + +```c +for (int i = 0; i < count; i++) + ret->tile_gids[i] = vals[i]-1; +``` + +Thankfully, it didn't require an hour worth of debugging inside GDB to find +this. + +Conclusion +---------- + +I think that if you can get away with it, try to parse a binary file of your own +creation rather than try to parse an existing standard. Why binary? Because you +can more finely control how the data is formatted and parsing becomes a +non-issue thanks to the ease of use of a standard C `FILE` pointer. I think the +next thing I'll do is write a script that converts these XML files into +something much more terse and less quirky. diff --git a/assets/tiled.png b/assets/tiled.png new file mode 100644 index 0000000..d818aeb Binary files /dev/null and b/assets/tiled.png differ diff --git a/assets/wrong_gid.png b/assets/wrong_gid.png new file mode 100644 index 0000000..3852502 Binary files /dev/null and b/assets/wrong_gid.png differ -- cgit v1.2.3