Summary
In this chapter, we built KantaiBERT
, a RoBERTa-like model transformer, from scratch using the building blocks provided by Hugging Face.
We first started by loading a customized dataset on a specific topic related to the works of Immanuel Kant. Depending on your goals, you can load an existing dataset or create your own. We saw that using a customized dataset provides insights into how a transformer model thinks. However, this experimental approach has its limits. Training a model beyond educational purposes would take a much larger dataset.
The KantaiBERT project was used to train a tokenizer on the kant.txt
dataset. The trained merges.txt
and vocab.json
files were saved. A tokenizer was recreated with our pretrained files. KantaiBERT built the customized dataset and defined a data collator to process the training batches for backpropagation. The trainer was initialized, and we explored the parameters of the RoBERTa model in detail. The model was trained and saved...