In this section, we will learn how to pre-train the BERT model. But what does pre-training mean? Say we have a model, . First, we train the model with a huge dataset for a particular task and save the trained model. Now, for a new task, instead of initializing a new model with random weights, we will initialize the model with the weights of our already trained model, (pre-trained model). That is, since the model is already trained on a huge dataset, instead of training a new model from scratch for a new task, we use the pre-trained model, , and adjust (fine-tune) its weights according to the new task. This is a type of transfer learning.
The BERT model is pre-trained on a huge corpus using two interesting tasks, called masked language modeling and next sentence prediction. Following pre-training, we save the pre-trained BERT model. For a new task, say question answering, instead of training BERT from scratch, we will use the pre-trained BERT model. That...