Parsing HTML
Downloading raw text or a binary file is a good starting point, but the main language of the web is HTML.
HTML is a structured language, defining different parts of a document such as headings and paragraphs. HTML is also hierarchical, defining sub-elements. The ability to parse raw text into a structured document is basically the ability to extract information automatically from a web page. For example, some text can be relevant if enclosed in certain HTML elements, such as a class
div
or after a heading h3
tag.
Getting ready
We'll use the excellent Beautiful Soup
module to parse HTML text into a memory object that can be analyzed. We need to use the latest version of the beautifulsoup4
package to be compatible with Python 3. Add the package to your requirements.txt
and install the dependencies in the virtual environment:
$ echo "beautifulsoup4==4.8.2" >> requirements.txt
$ pip install -r requirements.txt