Chapter 2. Replacing and Correcting Words
In this chapter, we will cover the following recipes:
- Stemming words
- Lemmatizing words with WordNet
- Replacing words matching regular expressions
- Removing repeating characters
- Spelling correction with Enchant
- Replacing synonyms
- Replacing negations with antonyms
Introduction
In this chapter, we will go over various word replacement and correction techniques. The recipes cover the gamut of linguistic compression, spelling correction, and text normalization. All of these methods can be very useful for preprocessing text before search indexing, document classification, and text analysis.
Stemming words
Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking
is cook
, and a good stemming algorithm knows that the ing
suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly...