Summary
In this chapter, we built KantaiBERT
, a RoBERTa-like model transformer, from scratch using the building blocks provided by Hugging Face.We first started by loading a customized dataset on a specific topic related to the works of Immanuel Kant. Depending on your goals, you can load an existing dataset or create your own. We saw that using a customized dataset provides insights into how a transformer model thinks. However, this experimental approach has its limits. Training a model beyond educational purposes would take a much larger dataset.The KantaiBERT project was used to train a tokenizer on the kant.txt
dataset. The trained merges.txt
and vocab.json
files were saved. A tokenizer was recreated with our pretrained files. KantaiBERT built the customized dataset and defined a data collator to process the training batches for backpropagation. The trainer was initialized, and we explored the parameters of the RoBERTa model in detail. The model was trained and saved.We saved the...