Replacing words matching regular expressions
Now, we are going to get into the process of replacing words. If stemming and lemmatization are a kind of linguistic compression, then word replacement can be thought of as error correction or text normalization.
In this recipe, we will replace words based on regular expressions, with a focus on expanding contractions. Remember when we were tokenizing words in Chapter 1, Tokenizing Text and WordNet Basics, and it was clear that most tokenizers had trouble with contractions? This recipe aims to fix this by replacing contractions with their expanded forms, for example, by replacing "can't" with "cannot" or "would've" with "would have".
Getting ready
Understanding how this recipe works will require a basic knowledge of regular expressions and the re
module. The key things to know are matching patterns and the re.sub()
function.
How to do it...
First, we need to define a number of replacement patterns...