BERT provides representation for only English text. Let's suppose we have an input text in a different language, say, French. Now, how we can use BERT to obtain a representation of the French text? Here is where we use M-BERT.
Multilingual BERT, referred to hereinafter as M-BERT, is used to obtain representations of text in different languages and not just English. We learned that the BERT model is trained with masked language modeling (MLM) and next sentence prediction (NSP) tasks using the English Wikipedia text and the Toronto BookCorpus. Similar to BERT, M-BERT is also trained with MLM and NSP tasks, but instead of using the Wikipedia text of only English language, M-BERT is trained using the Wikipedia text of 104 different languages.
But the question is, the size of the Wikipedia text for some languages would be higher than others right? Yes! The size of Wikipedia text would be large for high-resource languages, such as English, compared to...