GLM training
In this section, you will learn how it is possible to train your own language models. We will start with GPT-2 and get a deeper look inside its different functions for training using the transformers
library.
You can find any corpus to train your own GPT-2, but for this example, we used Emma by Jane Austen, which is a romantic novel. Training on a much bigger corpus is highly recommended to generate more general language.
Before we start, it’s good to note that we used TensorFlow’s native training functionality to show that all Hugging Face models can be directly trained on TensorFlow or PyTorch if you wish to. Follow these steps:
- You can download the Emma raw text by using the following command:
wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/gutenberg/austen-emma.txt
- The next step is to train the
BytePairEncoding
tokenizer for GPT-2 on a corpus that you intend to train your GPT-2 on. The following code will import...