At work I’m building a simple tool to populate a FogBugz wiki page with build information. One of the things this tool needs to do is pull the XHTML contents of a wiki page, parse it (as XML), and take action on the resulting document tree. Initially I expected this to be stupid-easy, as XHTML is just XML, right?
Au contrare!
The first problem is XHTML documents likely contain entity references like and whatnot. These entity references aren’t XML entities, they’re XHTML entities, so you must load the XHTML DTD in order to resolve them. Trouble is, this means there must be a proper XHTML DOCTYPE directive in your XHTML (which there isn’t in my case since I’m using fragments).
Once a valid DOCTYPE directive is added to the XHTML, now .NET will download the full DTD from W3 just to parse a little XHTML fragment. Not acceptable.