In the previous sections, we learned how M-BERT works and we also investigated how multilingual M-BERT is. We understood that the M-BERT model is pre-trained just like the regular BERT model, without any specific cross-lingual objective. In this section, let's learn how to pre-train BERT with a cross-lingual objective. We refer to BERT trained with a cross-lingual objective as a cross-lingual language model (XLM). The XLM model performs better than M-BERT and it learns cross-lingual representations.
The XLM model is pre-trained using the monolingual and parallel datasets. The parallel dataset consists of text in a language pair, that is, it consists of the same text in two different languages. Say we have an English sentence, and then we will have a corresponding sentence in another language, French, for example. We can call this parallel dataset a cross-lingual dataset.
The monolingual dataset is obtained from Wikipedia, and the parallel dataset...