WordPiece tokenizer
A better way to tokenize and extract features from text documents is to use a WordPiece tokenizer. This tokenizer works in such a way that it finds the most common pieces of text that it can discriminate, and also the ones that are the most common. This kind of tokenizer needs to be trained – that is, we need to provide a set of representative texts to get the right vocabulary (tokens).
Let’s look at an example where we use a simple program, a module from an open source project, to train such a tokenizer and then apply this tokenizer to the famous “Hello World” program in C. Let’s start by creating the tokenizer:
from tokenizers import BertWordPieceTokenizer # initialize the actual tokenizer tokenizer = BertWordPieceTokenizer( clean_text=True, handle_chinese_chars=False, strip_accents=False, lowercase=True )
In this example,...