HTML parsing in Python, the easy way
Parsing XML is straightforward in Python. But parsing malformed HTML can be troublesome. This little code snippet just achieves this by using µTidylib as a HTML to XHTML converter. You can then use whatever XML parser you want on the result (here I use the builtin minidom implementation).
import urllib
import tidy
from xml.dom.minidom import parseString as parseDOM
def getDOMFromHTML(url):
options = dict(output_xml=1, quote_nbsp=1, add_xml_decl=1, indent=1, tidy_mark=0)
bogus=urllib.urlopen(url).read()
proper_xml=unicode(tidy.parseString(bogus, **options)).replace(' ',' ') # fixes a bug in HTMLTidy
return parseDOM(proper_xml)
Leave a Comment