HTML parsing in Python, the easy way

Parsing XML is straightforward in Python. But parsing malformed HTML can be troublesome. This little code snippet just achieves this by using µTidylib as a HTML to XHTML converter. You can then use whatever XML parser you want on the result (here I use the builtin minidom implementation).

import urllib
import tidy
from xml.dom.minidom import parseString as parseDOM

def getDOMFromHTML(url):
 options = dict(output_xml=1, quote_nbsp=1, add_xml_decl=1, indent=1, tidy_mark=0)
 bogus=urllib.urlopen(url).read()
 proper_xml=unicode(tidy.parseString(bogus, **options)).replace('&nbsp;','&#x00a0;') # fixes a bug in HTMLTidy
 return parseDOM(proper_xml)

Source

HTML parsing in Python, the easy way

Leave a Comment

Facebook

Random Posts

Recent Comments

Popular Posts

Categories

Arquivo do blog

Tags

Recent Posts

Random Posts

Popular Posts

Source

HTML parsing in Python, the easy way

Related Posts

Leave a Comment

Facebook

Random Posts

Recent Comments

Popular Posts

Categories

Arquivo do blog

Tags

Recent Posts

Random Posts

Popular Posts