RoBERTa is another interesting and popular variant of BERT. Researchers observed that BERT is severely undertrained and proposed several approaches to pre-train the BERT model. RoBERTa is essentially BERT with the following changes in pre-training:
- Use dynamic masking instead of static masking in the MLM task.
- Remove the NSP task and train using only the MLM task.
- Train with a large batch size.
- Use byte-level BPE (BBPE) as a tokenizer.
Now, let's look into the details and discuss each of the preceding points.
Using dynamic masking instead of static masking
We learned that we pre-train BERT using the MLM and NSP tasks. In the MLM task, we randomly mask 15% of the tokens and let the network predict the masked token.
For instance, say we have the sentence We arrived at the airport in time. Now, after tokenizing and adding [CLS] and [SEP] tokens, we have the following:
tokens = [ [CLS], we, arrived, at, the, airport, in, time, [SEP...