Chapter 4: Text Preprocessing, Stemming, and Lemmatization
Textual data can be gathered from a number of different sources and takes many different forms. Text can be tidy and readable or raw and messy and can also come in many different styles and formats. Being able to preprocess this data so that it can be converted into a standard format before it reaches our NLP models is what we'll be looking at in this chapter.
Stemming and lemmatization, similar to tokenization, are other forms of NLP preprocessing. However, unlike tokenization, which reduces a document into individual words, stemming and lemmatization are attempts to reduce these words further to their lexical roots. For example, almost any verb in English has many different variations, depending on tense:
He jumped
He is jumping
He jumps
While all these words are different, they all relate to the same root word – jump. Stemming and lemmatization are both techniques we can use to reduce word variations...