In order to build up a vocabulary, the first thing to do is to break the documents or sentences into chunks called tokens. Each token carries a semantic meaning associated with it. Tokenization is one of the fundamental things to do in any text-processing activity. Tokenization can be thought of as a segmentation technique wherein you are trying to break down larger pieces of text chunks into smaller meaningful ones. Tokens generally comprise words and numbers, but they can be extended to include punctuation marks, symbols, and, at times, understandable emoticons.
Let’s go through a few examples to understand this better:
sentence = "The capital of China is Beijing"
sentence.split()
Here's the output.
['The', 'capital', 'of', 'China', 'is', 'Beijing']
A simple sentence.split() method could provide us with all the different tokens in the sentence The capital of China is Beijing. Each token in...