Skip navigation.

Syndicate

Syndicate content

User login

Annoyances in .NET XML libraries

At work I’m building a simple tool to populate a FogBugz wiki page with build information. One of the things this tool needs to do is pull the XHTML contents of a wiki page, parse it (as XML), and take action on the resulting document tree. Initially I expected this to be stupid-easy, as XHTML is just XML, right?

Au contrare!

Problem 1: XHTML is NOT just XML

The first problem is XHTML documents likely contain entity references like   and whatnot. These entity references aren’t XML entities, they’re XHTML entities, so you must load the XHTML DTD in order to resolve them. Trouble is, this means there must be a proper XHTML DOCTYPE directive in your XHTML (which there isn’t in my case since I’m using fragments).

Once a valid DOCTYPE directive is added to the XHTML, now .NET will download the full DTD from W3 just to parse a little XHTML fragment. Not acceptable. So, I had to download the XHTML strict DTD, and the three dependent DTDs containing entity definitions, and paste them together to create one big DTD with all the XHTML entities inline. I then added that file to my project as an embedded resource, and ripped off this code to write an HtmlResolver subclass of XmlUriResolver to intercept requests for the XHTML DTDs, and instead pull my jumbo DTD out of a resource stream and return that.

Problem 2: Automatic XHTML namespace

The second problem cropped up when I tried to issue an XPath query for all the h1 elements in my markup. For some reason, the call to SelectNodes was always returning zero matches, even though I know the XHTML contained a multitude of h1 elements. The cause became clear when I looked at the InnerXml property of my XmlDocument object. Something was adding a xmlns:namespace attribute to the html element I was generating for the root of my document tree, even though the string from which the XmlDocument was generated never explicitly specified a namespace. Since I was issuing an XPath query for h1 elements in the global namespace, and the parser put all the elements in the html namespace, it was silently skipping all my h1 elements!

The fix was to replace html as my root node with something not in the XHTML DTD, like wikipage. Still, that should not happen!

Idea for .NET 4.0: XHTML support, built in!