Pre-training and fine-tuning
There are two stages in NLP models that can be described as training. One is pre-training and the other is fine-tuning. In this section, we will discuss the main difference between these two concepts.
Pre-training is where we train a giant NLP model from scratch. In pre-training, we need to have a huge training dataset (for example, all the Wikipedia pages). It works as follows:
- We initialize the model weights.
- We partition the giant model into hundreds or thousands of GPUs via model parallelism.
- We feed the huge training dataset into the model-parallel training pipeline and train for several weeks or months.
- Once the model is converged to a good local minimum, we stop the training and call the model a pre-trained model.
By following the preceding steps, we can get a pre-trained NLP model.
Note that the pre-training process often takes huge amounts of computational resources and time. As of now, only big companies such as...