Reading HTML documents
A great deal of content on the Web is presented using HTML markup. A browser renders the data very nicely. How can we parse this data to extract the meaningful content from the displayed web page?
We can use the standard library html.parser
module, but it's not helpful. It only provides low-level lexical scanning information, but doesn't provide a high-level data structure that describes the original web page.
We'll use the Beautiful Soup module to parse HTML pages. This is available from the Python Package Index (PyPI). See https://pypi.python.org/pypi/beautifulsoup4.
This must be downloaded and installed to be useful. Generally, the pip
command does this job very nicely.
Often, this is as simple as the following:
pip install beautifulsoup4
For Mac OS X and Linux users, the sudo
command is required to escalate the user's privileges:
sudo pip install beautifulsoup4
This will prompt for the user's password. The user must be able to elevate themselves to have root privileges...