HTML parsing in Python, the easy way

Parsing XML is straightforward in Python. But parsing malformed HTML can be troublesome. This little code snippet just achieves this by using µTidylib as a HTML to XHTML converter. You can then use whatever XML parser you want on the result (here I use the builtin minidom implementation).

import urllib
import tidy
from xml.dom.minidom import parseString as parseDOM

def getDOMFromHTML(url):
 options = dict(output_xml=1, quote_nbsp=1, add_xml_decl=1, indent=1, tidy_mark=0)
 bogus=urllib.urlopen(url).read()
 proper_xml=unicode(tidy.parseString(bogus, **options)).replace(' ',' ') # fixes a bug in HTMLTidy
 return parseDOM(proper_xml)
Powered by Blogger.