Standard NLP tasks with specific vocabulary
This section focuses on Case 3: Rare words and Case 4: Replacing rare words from the Word2Vec
tokenization section of this chapter.
We will use Training_OpenAI_GPT_2_CH08.ipynb
, a renamed version of the notebook we used to train a dataset in Chapter 6, Text Generation with OpenAI GPT-2 and GPT-3 Models.
Two changes were made to the notebook:
dset
, the dataset, was renamedmdset
and contains medical content- A Python function was added to control the text that was tokenized using byte-level BPE
We will not describe Training_OpenAI_GPT_2_CH08.ipynb
in detail. If necessary, take some time to go back through Chapter 6, Text Generation with OpenAI GPT-2 and GPT-3 Models. Make sure you upload the necessary files before beginning, as explained in Chapter 6. The files are on GitHub in the gpt-2-train_files
directory of Chapter08
. Although we are using the same notebook as in Chapter 6, note that the dataset, dset...