summaryrefslogtreecommitdiff
path: root/_posts
diff options
context:
space:
mode:
authorDanny Holman <dholman@gymli.org>2020-02-09 23:16:30 -0600
committerDanny Holman <dholman@gymli.org>2020-02-09 23:16:30 -0600
commit17b4dea56d7bae31fd8cb639966abe8c5542845f (patch)
tree8c2345d438447a1a6ac41c278920b5c1fbf0627c /_posts
parent939a2a4bb39035c92aeee0af52ff3c456a202d2f (diff)
Add new post about XML parsing
Add a new post titled "Adventures in XML Parsing".
Diffstat (limited to '_posts')
-rw-r--r--_posts/2020-02-09-Adventures-in-XML-Parsing.md123
1 files changed, 123 insertions, 0 deletions
diff --git a/_posts/2020-02-09-Adventures-in-XML-Parsing.md b/_posts/2020-02-09-Adventures-in-XML-Parsing.md
new file mode 100644
index 0000000..bfa844f
--- /dev/null
+++ b/_posts/2020-02-09-Adventures-in-XML-Parsing.md
@@ -0,0 +1,123 @@
+---
+layout: post
+title: Adventures in XML Parsing
+---
+
+I think pretty much everyone has realized at this point that XML is not very
+easy to parse. With a little documentation and a helpful parsing library it
+should be, at the very least, managable right?
+
+That's what I thought when I attempted to write a TMX parser for the first
+time. I quickly found out how much of a pain it is to parse XML even with the
+format documentation right in front of me and a robust library to work with.
+
+Seemingly random blank tags
+---------------------------
+
+I think the main issue with the XML standard is just how many quirks a file can
+possibly have. Things like this:
+
+```xml
+<doc>
+ <element>data</element>
+ <- There's a blank tag here! ->
+ <element>more data</element>
+</doc>
+```
+
+That blank tag counted for the whitespace that *supposedly* exists there. When
+this is detected by LibXML2, a blank `<text>` tag is placed in between two,
+otherwise valid, XML tags. Now when parsed by Python or Javascript or other
+languages where pointers are essentially non-existant, this should never come
+up. When parsed with a language like C however...well
+
+```sh
+zsh: segmentation fault (core dumped) ./test
+```
+
+So, like any good developer, I ran it under Valgrind, and soon discovered that
+this is no ordinary memory fault.
+
+```sh
+==2373696== Invalid read of size 8
+==2373696== by _parse_layer(void*)
+==2373696== at xmlStrEqual(nodePtr*, xmlChar*)
+==2373696== Address 0x0 not stack'd, malloc'd or (recently) free'd
+```
+
+Now at this point, I'm thinking "Wait the bug is in LibXML? That can't be
+right." GDB, with liberal use of `bt` and `print` pointed at the same result:
+that the bug resided with LibXML. None of this made sense in the slightest. Why
+on earth would a professionally written software library that was essentially
+a standard fixture on many Unix-like systems have a major memory bug in it? The
+answer would not reveal itself until observing the program with hardware
+watchpoints.
+
+```sh
+(gdb) watch *node
+Hardware watchpoint 2: *node
+(gdb)
+...
+
+Hardware watchpoing 2: node
+
+Old value = (nodePtr *) 0x5555...
+New value = (nodePtr *) 0x0
+```
+
+There you are! This was that pesky `<text>` tag. Apparently, any, and I do mean
+*any*, whitespace detected by LibXML, including the space inserted by my level
+editor, produces this strange `<text>` tag that seems to be there for no clear
+reason. Three new helper functions and judicious use of
+`nodePtr = nodePtr->next` later and that problem is solved.
+
+Comma separated values *inside* XML tags
+----------------------------------------
+
+XML can be a beast to parse, but CSV? I can parse that very easily using
+standard library functions like `strtok`. The problem came when the values in
+this list of values did not match the values inside the level editor.
+
+![Tiled Map Editor](/assets/tiled.png)
+
+```xml
+<data encoding="csv">
+49,50,50...50,51
+97,
+.
+.
+.
+145,146...146,147
+</data>
+```
+
+That first tile in the upper left corner? It has a GID of 48 inside the
+*editor*. In the *file*, it has GID of 49. This difference is not readily
+apparent from the editor or the file itself unless you know its there. This
+created the interesting case where my level looked pretty good in the editor but
+looked like someone placed tiles seemingly at random when my engine loaded the
+file into memory.
+
+![](/assets/wrong_gid.png)
+
+The documentation failed to mention this too, making this all the more difficult
+to track down. Eventually, I did manage to figure out that when the tile GID is
+extracted from the file to just decrement the value.
+
+```c
+for (int i = 0; i < count; i++)
+ ret->tile_gids[i] = vals[i]-1;
+```
+
+Thankfully, it didn't require an hour worth of debugging inside GDB to find
+this.
+
+Conclusion
+----------
+
+I think that if you can get away with it, try to parse a binary file of your own
+creation rather than try to parse an existing standard. Why binary? Because you
+can more finely control how the data is formatted and parsing becomes a
+non-issue thanks to the ease of use of a standard C `FILE` pointer. I think the
+next thing I'll do is write a script that converts these XML files into
+something much more terse and less quirky.