Understanding lemmatization
A lemma is the base form of a token. You can think of a lemma as the form in which the token appears in a dictionary. For instance, the lemma of eating is eat; the lemma of eats is eat; ate similarly maps to eat. Lemmatization is the process of reducing the word forms to their lemmas. The following code is a quick example of how to do lemmatization with spaCy:
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("I went there for working and worked for 3 years.")
for token in doc:
print(token.text, token.lemma_)
I -PRON-
went go
there
for for
working work
and and
worked work
for for
3 3
years year
. .
By now, you should be familiar with what the first three lines of the code do. Recall that we import the spacy
library, load an English model using spacy.load
, create a pipeline, and apply the pipeline to the preceding sentence to get a Doc object. Here we iterated over tokens to get their text and...