Tokenizing and normalizing text
Extracting the contents of the page is just the first step. Before we get to the fun part of analyzing what the article contains (or, if you looked at blog posts, what they are about), we need to split the whole article into sentences and further into words.
Having done so, we would still face another issue; in any of the text, we would see sentences in different tenses, people using the passive voice, or some rarely seen grammatical constructs. For the purpose of extracting the topic or analyzing the sentiment, we do not really need to see words said
and says
separately—the word say
would be enough. Thus, we will also be looking at normalizing the text, that is, bringing all the different versions of the same word to some common form.
Getting ready
To execute this recipe, all you need is the Natural Language Toolkit (NLTK). Before we start, however, you need to make sure that the NLTK module is present on your machine. If you are using Anaconda, this is simple...