AR language model training
In this section, you will learn how it is possible to train your own AR language models. We will start with GPT-2 and get a deeper look inside its different functions for training, using the transformers
library.
You can find any specific corpus to train your own GPT-2, but for this example, we used Emma by Jane Austen, which is a romantic novel. Training on a much bigger corpus is highly recommended to have a more general language generation.
Before we start, it's good to note that we used TensorFlow's native training functionality to show that all Hugging Face models can be directly trained on TensorFlow or PyTorch if you wish to. Follow these steps:
- You can download the Emma novel raw text by using the following command:
wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/gutenberg/austen-emma.txt
- The first step is to train the
BytePairEncoding
tokenizer for GPT-2 on a corpus that you intend to train your...