Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.
Here is an example of tokenization:
Input: Friends, Romans, Countrymen, lend me your ears;
Output: .
It is, in fact, sometimes useful to distinguish between tokens and words. But here, for ease of understanding, we will use them interchangeably.
We will convert the raw text into a list of words. This should preserve the original ordering of the text.
There are several ways to do this, so let's try a few of them out. We will program two methods from scratch to build our intuition, and then check how spaCy handles tokenization.