Summary
In this chapter, we built KantaiBERT
, a RoBERTa-like model transformer, from scratch using the construction blocks provided by Hugging Face.
We first started by loading a customized dataset on a specific topic related to the works of Immanuel Kant. You can load an existing dataset or create your own depending on your goals. We saw that using a customized dataset provides insights into the way a transformer model thinks. However, this experimental approach has its limits. It would take a much larger dataset to train a model beyond educational purposes.
The KantaiBERT project was used to train a tokenizer on the kant.txt
dataset. The trained merges.txt
and vocab.json
files were saved. A tokenizer was recreated with our pretrained files. KantaiBERT built the customized dataset and defined a data collator to process the training batches for backpropagation. The trainer was initialized, and we explored the parameters of the RoBERTa model in detail. The model was trained...