diff options
author | Danny Holman <dholman@gymli.org> | 2020-02-09 23:16:30 -0600 |
---|---|---|
committer | Danny Holman <dholman@gymli.org> | 2020-02-09 23:16:30 -0600 |
commit | 17b4dea56d7bae31fd8cb639966abe8c5542845f (patch) | |
tree | 8c2345d438447a1a6ac41c278920b5c1fbf0627c | |
parent | 939a2a4bb39035c92aeee0af52ff3c456a202d2f (diff) |
Add new post about XML parsing
Add a new post titled "Adventures in XML Parsing".
-rw-r--r-- | _posts/2020-02-09-Adventures-in-XML-Parsing.md | 123 | ||||
-rw-r--r-- | assets/tiled.png | bin | 0 -> 211256 bytes | |||
-rw-r--r-- | assets/wrong_gid.png | bin | 0 -> 39519 bytes |
3 files changed, 123 insertions, 0 deletions
diff --git a/_posts/2020-02-09-Adventures-in-XML-Parsing.md b/_posts/2020-02-09-Adventures-in-XML-Parsing.md new file mode 100644 index 0000000..bfa844f --- /dev/null +++ b/_posts/2020-02-09-Adventures-in-XML-Parsing.md @@ -0,0 +1,123 @@ +--- +layout: post +title: Adventures in XML Parsing +--- + +I think pretty much everyone has realized at this point that XML is not very +easy to parse. With a little documentation and a helpful parsing library it +should be, at the very least, managable right? + +That's what I thought when I attempted to write a TMX parser for the first +time. I quickly found out how much of a pain it is to parse XML even with the +format documentation right in front of me and a robust library to work with. + +Seemingly random blank tags +--------------------------- + +I think the main issue with the XML standard is just how many quirks a file can +possibly have. Things like this: + +```xml +<doc> + <element>data</element> + <- There's a blank tag here! -> + <element>more data</element> +</doc> +``` + +That blank tag counted for the whitespace that *supposedly* exists there. When +this is detected by LibXML2, a blank `<text>` tag is placed in between two, +otherwise valid, XML tags. Now when parsed by Python or Javascript or other +languages where pointers are essentially non-existant, this should never come +up. When parsed with a language like C however...well + +```sh +zsh: segmentation fault (core dumped) ./test +``` + +So, like any good developer, I ran it under Valgrind, and soon discovered that +this is no ordinary memory fault. + +```sh +==2373696== Invalid read of size 8 +==2373696== by _parse_layer(void*) +==2373696== at xmlStrEqual(nodePtr*, xmlChar*) +==2373696== Address 0x0 not stack'd, malloc'd or (recently) free'd +``` + +Now at this point, I'm thinking "Wait the bug is in LibXML? That can't be +right." GDB, with liberal use of `bt` and `print` pointed at the same result: +that the bug resided with LibXML. None of this made sense in the slightest. Why +on earth would a professionally written software library that was essentially +a standard fixture on many Unix-like systems have a major memory bug in it? The +answer would not reveal itself until observing the program with hardware +watchpoints. + +```sh +(gdb) watch *node +Hardware watchpoint 2: *node +(gdb) +... + +Hardware watchpoing 2: node + +Old value = (nodePtr *) 0x5555... +New value = (nodePtr *) 0x0 +``` + +There you are! This was that pesky `<text>` tag. Apparently, any, and I do mean +*any*, whitespace detected by LibXML, including the space inserted by my level +editor, produces this strange `<text>` tag that seems to be there for no clear +reason. Three new helper functions and judicious use of +`nodePtr = nodePtr->next` later and that problem is solved. + +Comma separated values *inside* XML tags +---------------------------------------- + +XML can be a beast to parse, but CSV? I can parse that very easily using +standard library functions like `strtok`. The problem came when the values in +this list of values did not match the values inside the level editor. + +![Tiled Map Editor](/assets/tiled.png) + +```xml +<data encoding="csv"> +49,50,50...50,51 +97, +. +. +. +145,146...146,147 +</data> +``` + +That first tile in the upper left corner? It has a GID of 48 inside the +*editor*. In the *file*, it has GID of 49. This difference is not readily +apparent from the editor or the file itself unless you know its there. This +created the interesting case where my level looked pretty good in the editor but +looked like someone placed tiles seemingly at random when my engine loaded the +file into memory. + +![](/assets/wrong_gid.png) + +The documentation failed to mention this too, making this all the more difficult +to track down. Eventually, I did manage to figure out that when the tile GID is +extracted from the file to just decrement the value. + +```c +for (int i = 0; i < count; i++) + ret->tile_gids[i] = vals[i]-1; +``` + +Thankfully, it didn't require an hour worth of debugging inside GDB to find +this. + +Conclusion +---------- + +I think that if you can get away with it, try to parse a binary file of your own +creation rather than try to parse an existing standard. Why binary? Because you +can more finely control how the data is formatted and parsing becomes a +non-issue thanks to the ease of use of a standard C `FILE` pointer. I think the +next thing I'll do is write a script that converts these XML files into +something much more terse and less quirky. diff --git a/assets/tiled.png b/assets/tiled.png Binary files differnew file mode 100644 index 0000000..d818aeb --- /dev/null +++ b/assets/tiled.png diff --git a/assets/wrong_gid.png b/assets/wrong_gid.png Binary files differnew file mode 100644 index 0000000..3852502 --- /dev/null +++ b/assets/wrong_gid.png |