Friday, January 22, 2010

Validating DTDs in python with lxml

After replacing the string-pasting in Buildbot's web-backend with proper html-templates (using Jinja), I decided to write a unit-test that follows all links on all pages and make sure they validate and don't contain stale links.

lxml seemed like a nice tool for that, but our XHTML starts with a reference to w3.org's DTD I got this straight away:
failed to load HTTP resource
Googling found me the Caching DTDs using lxml and etree article by Jimmy Stratton. He had ran into the same problem last fall, dug around for a while (against the forces of nature, in this case represented by authors and documentation for lxml and python), and was nice enough to post his solution on the web.

The resolver works flawlessly. Thanks Jimmy! I owe you one.

Update: Since this is a reasonably popular post, here's the unit test for BuildBot, which applies this using the Twisted framework to validate all linked-to pages on the website. (It's was lost from the source tree in one of the many refactorings and cleanups of BuildBot's test suite. Also, it added some troublesome dependencies for the many builds slaves, and was too broad in scope. It should be re-cast as unit-tests for each page's possible states.)