Reading raw text from the Web
Most of the times, the free-form text can be found in text files; in this recipe, we will not be teaching you how to do that as we have already presented many ways of doing so. (Refer to the set of recipes in Chapter 1, Preparing the Data.)
Note
One way of reading a file that we have not explored yet will be discussed in the next recipe.
Many times, however, we need to read data straight from the web: we might want to analyze a blog post, scrape an article, or analyze Facebook or Twitter posts. While Facebook and Twitter offer Application Programming Interfaces (APIs) that normally return answers in XML or JSON formats, processing HTML files is not as straightforward.
In this recipe, you will learn how to access a web page, read its content, and process it.
Getting ready
To execute this recipe, you will need urllib
, html5lib
, and Beautiful Soup
.
Urllib comes with Python 3 (https://docs.python.org/3/library/urllib.html). If, however, your configuration does not have...