Reading HTML documents
A great deal of content on the web is presented using HTML markup. A browser renders the data very nicely. How can we parse this data to extract the meaningful content from the displayed web page?
We can use the standard library html.parser
module, but it's not as helpful as we'd like. It only provides low-level lexical scanning information; it doesn't provide a high-level data structure that describes the original web page.
Instead, we'll use the Beautiful Soup module to parse HTML pages into more useful data structures. This is available from the Python Package Index (PyPI). See https://pypi.python.org/pypi/beautifulsoup4.
This must be downloaded and installed. Often, this is as simple as doing the following:
python -m pip install beautifulsoup4
Using the python -m pip
command ensures that we will use the pip
command that goes with the currently active virtual environment.
Getting ready
We've gathered some...