The XLM-R model is basically an extension of XLM with a few modifications to improve performance. XLM-R is the XLM-RoBERTa model and represents the state of the art for learning cross-lingual representation. In the previous section, we learned how XLM works. We learned that XLM is trained with MLM and TLM tasks. The MLM task uses the monolingual dataset, and the TLM task uses the parallel dataset. However, obtaining this parallel dataset is difficult for low-resource languages. So, in the XLM-R model, we train the model only with the MLM objective and we don't use the TLM objective. Thus, the XLM-R model requires only a monolingual dataset.
XLM-R is trained on a huge dataset whose size is 2.5 TB. The dataset is obtained by filtering the unlabeled text of 100 languages from the CommonCrawl dataset. We also increase the proportion of small languages in our dataset through sampling. The following diagram provides a comparison of the corpus size of the CommonCrawl...