You can see that the first step in this pipeline is tokenizing – what exactly is this?
Tokenization is the task of splitting a text into meaningful segments, called tokens. These segments could be words, punctuation, numbers, or other special characters that are the building blocks of a sentence. In spaCy, the input to the tokenizer is a Unicode text, and the output is a Doc object [19].
Different languages will have different tokenization rules. Let's look at an example of how tokenization might work in English. For the sentence – Let us go to the park., it's quite straightforward, and would be broken up as follows, with the appropriate numerical indices:
0
|
1
|
2
|
3
|
4
|
5
|
6
|
Let
|
us
|
go
|
to
|
the
|
park
|
.
|
This looks awfully like the result when we just run text.split(' ') – when does tokenizing...