Standard NLP tasks with specific vocabulary
This section focuses on Case 4: Rare words and Case 5: Replacing rare words from the Word2Vec tokenization section of this chapter.
We will use Training_OpenAI_GPT_2_CH09.ipynb
, a renamed version of the notebook we used to train a dataset in Chapter 7, The Rise of Suprahuman Transformers with GPT-3 Engines.
Two changes were made to the notebook:
dset
, the dataset, was renamedmdset
and contains medical content- A Python function was added to control the text that was tokenized using byte-level BPE
We will not describe Training_OpenAI_GPT_2_CH09.ipynb
, which we covered in Chapter 7, The Rise of Suprahuman Transformers with GPT-3 Engines, and Appendices III and IV. Make sure you upload the necessary files before beginning, as explained in Chapter 7.
There is no limit to the time you wish to train the model for. Interrupt it in order to save the model.
The files are on GitHub in the gpt-2-train_files...