- We can use the pre-trained model in the following two ways:
- As a feature extractor by extracting embeddings
- By fine-tuning the pre-trained BERT model on downstream tasks such as text classification, question-answering, and more
- The [PAD] token is used to match the token length.
- To make our model understand that the [PAD] token is added only to match the tokens length and that it is not part of the actual tokens, we use an attention mask. We set the attention mask value to 1 in all positions and 0 for the position where we have the [PAD] token.
- Fine-tuning implies that we are not training BERT from scratch; instead, we are using the already-trained BERT and updating its weights according to our task.
- For each token , we compute the dot product between the representation of the token and the start vector . Next, we apply the softmax function to the dot product and obtain the probability: . Next, we compute the starting index by selecting...