The SentencePiece tokenizer
SentencePiece is a more general option than BPE for one more reason: it allows us to treat whitespaces as regular tokens. This allows us to find more complex dependencies and therefore train models that understand more than just pieces of words. Hence the name – SentencePiece. This tokenizer was originally introduced to enable the tokenization of languages such as Japanese, which do not use whitespaces in the same way as, for example, English. The tokenizer can be installed by running the pip install -q
sentencepiece
command.
In the following code example, we’re instantiating and training the SentencePiece tokenizer:
import sentencepiece as spm # this statement trains the tokenizer spm.SentencePieceTrainer.train('--input="/content/drive/MyDrive/ds/cs_dos/nx_icmp_checksum_compute.c" --model_prefix=m --vocab_size=200') # makes segmenter instance and # loads the model file (m.model) sp = spm.SentencePieceProcessor() sp...