BPE
A more advanced method for tokenizing text is the BPE algorithm. This algorithm is based on the same premises as the compression algorithm that was created in the 1990s by Gage. The algorithm compresses a series of bytes by the bytes not used in the compressed data. The BPE tokenizer does a similar thing, except that it replaces a series of tokens with new bytes that are not used in the text. In this way, the algorithm can create a much larger vocabulary than CountVectorizer
and the WordPiece tokenizer. BPE is very popular both for its ability to handle large vocabulary and for its efficient implementation through the fastBPE library.
Let’s explore how to apply this tokenizer to the same data and check the difference between the previous two. The following code fragment shows how to instantiate this tokenizer from the Hugging Face library:
# in this example we use the tokenizers # from the HuggingFace library from tokenizers import Tokenizer from tokenizers.models...