Add new post about XML parsing

Add a new post titled "Adventures in XML Parsing".
author: Danny Holman <dholman@gymli.org> 2020-02-09 23:16:30 -0600
committer: Danny Holman <dholman@gymli.org> 2020-02-09 23:16:30 -0600
commit: 17b4dea56d7bae31fd8cb639966abe8c5542845f (patch)
tree: 8c2345d438447a1a6ac41c278920b5c1fbf0627c
parent: Fix missing date format in post header (diff)
download: blog-17b4dea56d7bae31fd8cb639966abe8c5542845f.tar.gz
blog-17b4dea56d7bae31fd8cb639966abe8c5542845f.tar.zst
blog-17b4dea56d7bae31fd8cb639966abe8c5542845f.zip
3 files changed, 123 insertions, 0 deletions
diff --git a/_posts/2020-02-09-Adventures-in-XML-Parsing.md b/_posts/2020-02-09-Adventures-in-XML-Parsing.md
new file mode 100644
index 0000000..bfa844f
--- /dev/null
+++ b/_posts/2020-02-09-Adventures-in-XML-Parsing.md
@@ -0,0 +1,123 @@
+---
+layout: post
+title: Adventures in XML Parsing
+---
+
+I think pretty much everyone has realized at this point that XML is not very
+easy to parse. With a little documentation and a helpful parsing library it
+should be, at the very least, managable right?
+
+That's what I thought when I attempted to write a TMX parser for the first
+time. I quickly found out how much of a pain it is to parse XML even with the
+format documentation right in front of me and a robust library to work with.
+
+Seemingly random blank tags
+---------------------------
+
+I think the main issue with the XML standard is just how many quirks a file can
+possibly have. Things like this:
+
+```xml
+<doc>
+        <element>data</element>
+        <- There's a blank tag here! ->
+        <element>more data</element>
+</doc>
+```
+
+That blank tag counted for the whitespace that *supposedly* exists there. When
+this is detected by LibXML2, a blank `<text>` tag is placed in between two,
+otherwise valid, XML tags. Now when parsed by Python or Javascript or other
+languages where pointers are essentially non-existant, this should never come
+up. When parsed with a language like C however...well
+
+```sh
+zsh: segmentation fault (core dumped)   ./test
+```
+
+So, like any good developer, I ran it under Valgrind, and soon discovered that
+this is no ordinary memory fault.
+
+```sh
+==2373696== Invalid read of size 8
+==2373696==     by _parse_layer(void*)
+==2373696==     at xmlStrEqual(nodePtr*, xmlChar*)
+==2373696==  Address 0x0 not stack'd, malloc'd or (recently) free'd
+```
+
+Now at this point, I'm thinking "Wait the bug is in LibXML? That can't be
+right." GDB, with liberal use of `bt` and `print` pointed at the same result:
+that the bug resided with LibXML. None of this made sense in the slightest. Why
+on earth would a professionally written software library that was essentially
+a standard fixture on many Unix-like systems have a major memory bug in it? The
+answer would not reveal itself until observing the program with hardware
+watchpoints.
+
+```sh
+(gdb) watch *node
+Hardware watchpoint 2: *node
+(gdb)
+...
+
+Hardware watchpoing 2: node
+
+Old value = (nodePtr *) 0x5555...
+New value = (nodePtr *) 0x0
+```
+
+There you are! This was that pesky `<text>` tag. Apparently, any, and I do mean
+*any*, whitespace detected by LibXML, including the space inserted by my level
+editor, produces this strange `<text>` tag that seems to be there for no clear
+reason. Three new helper functions and judicious use of
+`nodePtr = nodePtr->next` later and that problem is solved.
+
+Comma separated values *inside* XML tags
+----------------------------------------
+
+XML can be a beast to parse, but CSV? I can parse that very easily using
+standard library functions like `strtok`. The problem came when the values in
+this list of values did not match the values inside the level editor.
+
+![Tiled Map Editor](/assets/tiled.png)
+
+```xml
+<data encoding="csv">
+49,50,50...50,51
+97,
+.
+.
+.
+145,146...146,147
+</data>
+```
+
+That first tile in the upper left corner? It has a GID of 48 inside the
+*editor*. In the *file*, it has GID of 49. This difference is not readily
+apparent from the editor or the file itself unless you know its there. This
+created the interesting case where my level looked pretty good in the editor but
+looked like someone placed tiles seemingly at random when my engine loaded the
+file into memory.
+
+![](/assets/wrong_gid.png)
+
+The documentation failed to mention this too, making this all the more difficult
+to track down. Eventually, I did manage to figure out that when the tile GID is
+extracted from the file to just decrement the value.
+
+```c
+for (int i = 0; i < count; i++)
+        ret->tile_gids[i] = vals[i]-1;
+```
+
+Thankfully, it didn't require an hour worth of debugging inside GDB to find
+this.
+
+Conclusion
+----------
+
+I think that if you can get away with it, try to parse a binary file of your own
+creation rather than try to parse an existing standard. Why binary? Because you
+can more finely control how the data is formatted and parsing becomes a
+non-issue thanks to the ease of use of a standard C `FILE` pointer. I think the
+next thing I'll do is write a script that converts these XML files into
+something much more terse and less quirky.
diff --git a/assets/tiled.png b/assets/tiled.png
new file mode 100644
index 0000000..d818aeb
--- /dev/null
+++ b/assets/tiled.png
diff --git a/assets/wrong_gid.png b/assets/wrong_gid.png
new file mode 100644
index 0000000..3852502
--- /dev/null
+++ b/assets/wrong_gid.png
author	Danny Holman <dholman@gymli.org>	2020-02-09 23:16:30 -0600
committer	Danny Holman <dholman@gymli.org>	2020-02-09 23:16:30 -0600
commit	17b4dea56d7bae31fd8cb639966abe8c5542845f (patch)
tree	8c2345d438447a1a6ac41c278920b5c1fbf0627c
parent	Fix missing date format in post header (diff)
download	blog-17b4dea56d7bae31fd8cb639966abe8c5542845f.tar.gz blog-17b4dea56d7bae31fd8cb639966abe8c5542845f.tar.zst blog-17b4dea56d7bae31fd8cb639966abe8c5542845f.zip