We started off the chapter by understanding how ALBERT works. We learned that ALBERT is a lite version of BERT and it uses two interesting parameter reduction techniques, called cross-layer parameter sharing and factorized embedding parameterization. We also learned about the SOP task used in ALBERT. We learned that SOP is a binary classification task where the goal of the model is to classify whether the given sentence pair is swapped or not.
After understanding the ALBERT model, we looked into the RoBERTa model. We learned that the RoBERTa is a variant of BERT and it uses only the MLM task for training. Unlike BERT, it uses dynamic masking instead of static masking and it is trained with a large batch size. It uses BBPE as a tokenizer and it has a vocabulary size of 50,000.
Following RoBERTa, we learned about the ELECTRA model. In ELECTRA, instead of using MLM task as a pre-training objective, we used a new pre-training strategy called replaced token detection. In the replaced...