Subword tokenization is popularly used in many state-of-the-art natural language models, including BERT and GPT-3. It is very effective in handling OOV words. In this section, we will understand how subword tokenization works in detail. Before directly looking into subword tokenization, let's first take a look at word-level tokenization.
Let's suppose we have a training dataset. Now, from this training set, we build a vocabulary. To build the vocabulary, we split the text present in the dataset by white space and add all the unique words to the vocabulary. Generally, the vocabulary consists of many words (tokens), but just for the sake of an example, let's suppose our vocabulary consists of just the following words:
vocabulary = [game, the, I, played, walked, enjoy]
Now that we have created the vocabulary, we use this vocabulary for tokenizing the input. Let's consider an input sentence "I played the game". In order to create...