Training a tokenizer and pretraining a transformer
In this chapter, we will train a transformer model named KantaiBERT
using the building blocks provided by Hugging Face for BERT-like models. We covered the theory of the building blocks of the model we will be using in Chapter 2, Fine-Tuning BERT Models.
We will describe KantaiBERT, building on the knowledge we acquired in the previous chapters.
KantaiBERT is a Robustly Optimized BERT Pretraining Approach (RoBERTa)-like model based on the architecture of BERT.
The initial BERT models were undertrained. RoBERTa increases the performance of pretraining transformers for downstream tasks. RoBERTa has improved the mechanics of the pretraining process. For example, it does not use WordPiece
tokenization but goes down to byte-level Byte Pair Encoding (BPE).
In this chapter, KantaiBERT, like BERT, will be trained using masked language modeling.
KantaiBERT will be trained as a small model with 6
layers, 12
heads, and 84...