HTML parsing in Python, the easy way
Parsing XML is straightforward in Python. But parsing malformed HTML can be troublesome. This little code snippet just achieves this by using µTidylib as a HTML to XHTML converter. You can then use whatever XML parser you want on the result (here I use the builtin minidom implementation).
import urllib import tidy from xml.dom.minidom import parseString as parseDOM def getDOMFromHTML(url): options = dict(output_xml=1, quote_nbsp=1, add_xml_decl=1, indent=1, tidy_mark=0) bogus=urllib.urlopen(url).read() proper_xml=unicode(tidy.parseString(bogus, **options)).replace(' ',' ') # fixes a bug in HTMLTidy return parseDOM(proper_xml)
Leave a Comment