Normalization is the process of preparing text for subsequent analysis. This is frequently performed once the text has been tokenized. Normalization activities include such tasks as converting the text to lowercase, validating data, inserting missing elements, stemming, lemmatization, and removing stop words.
We have already examined the stemming and lemmatization process in earlier recipes. In this recipe, we will show how stop words can be removed. Stop words are those words that are not always useful. For example, some downstream NLP tasks do not need to have words such as a, the, or and. These types of words are the common words found in a language. Analysis can often be enhanced by removing them from a text.