Parsing HTML with lxml
Another powerful, fast, and flexible parser is the HTML Parser that comes with lxml. As lxml is an extensive library written for parsing both XML and HTML documents, it can handle messed up tags in the process.
Let's start with an example.
Here, we will use the requests module to retrieve the web page and parse it with lxml:
#Importing modules from lxml import html import requests response = requests.get('http://packtpub.com/') tree = html.fromstring(response.content)
Now the whole HTML is saved to tree
in a nice tree structure that we can inspect in two different ways: XPath or CSS Select. XPath is used to navigate through elements and attributes to find information in structured documents such as HTML or XML.
We can use any of the page inspect tools, such as Firebug or Chrome developer tools, to get the XPath of an element:
If we want to get the book names and prices from the list, find the following section in the source.
<div...