We started off the chapter by understanding how the M-BERT model works. We understood that M-BERT is trained without any cross-lingual objective, just like how we trained the BERT model, and it produces a representation that generalizes across multiple languages for downstream tasks.
Moving on, we investigated how multilingual our M-BERT is. We learned that M-BERT's generalizability does not depend on the vocabulary overlap, relying instead on typological and language similarity. We also saw that M-BERT can handle code switched text, but not transliterated text.
Later, we learned about the XLM model, where we train BERT with a cross-lingual objective. We train XLM with MLM and TLM tasks. The TLM task works just like MLM, but in TLM, we train the model on cross-lingual data, that is, parallel data consisting of the same text in two different languages.
Next, we explored the XLM-R model, which uses the RoBERTa architecture. We train the XLM-R model only on the MLM task and...