Performing stemming and lemmatization
When analyzing text data, we usually need to reduce words to their root or base form. This process is called stemming. Stemming is required because words can appear in several variations depending on the context. Stemming ensures the words are reduced to a common form. This helps to improve the accuracy of our analysis because several variations of the same word can cause noise within our dataset.
Lemmatization also reduces a word to its base or root form; however, unlike stemming, it considers the context and part of speech to achieve this. While stemming just takes off the last characters or suffixes of a word in order to get the root form, lemmatization considers the structure and parts of words, such as root, prefixes, and suffixes, as well as how parts of speech or context change a word’s meaning.
Lemmatization generally produces more accurate results than stemming. The following example illustrates this:
- Original text...