Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens. For example, we can divide a chunk of text into words, or we can divide it into sentences. Depending on the task at hand, we can define our own conditions to divide the input text into meaningful tokens. Let's take a look at how to do this.
Preprocessing data using tokenization
Getting ready
Tokenization is the first step in the computational analysis of the text and involves dividing the sequences of characters into minimal units of analysis called tokens. Tokens include various categories of text parts (words, punctuation, numbers, and so on), and can also be complex units (such as dates). In this recipe, we...