Word-level analysis
This section will discuss two approaches to analyzing words. The first one, lemmatization, involves breaking words down into their components in order reduce the variation in texts. The second one discusses some ideas for making use of hierarchically organized semantic information about the meanings of words in the form of ontologies.
Lemmatization
In our earlier discussion of preprocessing text in Chapter 5, we went over the task of lemmatization (and the related task of stemming) as a tool for regularizing text documents so that there is less variation in the documents we are analyzing. As we discussed, the process of lemmatization converts each word in the text to its root word, discarding information such as plural endings like -s in English. Lemmatization also requires a dictionary, because the dictionary supplies the root words for the words being lemmatized. We used Princeton University’s WordNet (https://wordnet.princeton.edu/) as a dictionary...