Introducing tokenization
We saw in Figure 2.1 that the first step in a text processing pipeline is tokenization. Tokenization is always the first operation because all the other operations require the tokens.
Tokenization simply means splitting the sentence into its tokens. A token is a unit of semantics. You can think of a token as the smallest meaningful part of a piece of text. Tokens can be words, numbers, punctuation, currency symbols, and any other meaningful symbols that are the building blocks of a sentence. The following are examples of tokens:
USA
N.Y.
city
33
3rd
!
…
?
's
Input to the spaCy tokenizer is a Unicode text and the result is a Doc
object. The following code shows the tokenization process:
import spacy nlp = spacy.load("en_core_web_md") doc = nlp("I own a ginger cat.") print ([token.text for token in doc]) ['I', 'own', 'a', 'ginger&apos...