Combining similar words – lemmatization
We can find the canonical form of the word using lemmatization. For example, the lemma of the word cats is cat, and the lemma for the word ran is run. This is useful when we are trying to match some word and don’t want to list out all the possible forms. Instead, we can just use its lemma.
Getting ready
We will be using the spaCy package for this recipe.
How to do it…
When the spaCy model processes a piece of text, the resulting Document
object contains an iterator over the Token
objects within it, as we saw in the Part of speech tagging recipe. These Token
objects contain the lemma information for each word in the text.
Here are the steps for getting the lemmas:
- Import the file and language
utils
files. This will import spaCy and initialize thesmall_model
object:%run -i "../util/file_utils.ipynb" %run -i "../util/lang_utils.ipynb"
- Create a list of words we want to lemmatize...