In the previous section, we learned about M-BERT. We learned that M-BERT is trained on the Wikipedia text of 104 different languages. We also evaluated M-BERT by fine-tuning it on the XNLI dataset. But how multilingual is our M-BERT? How is a single model able to transfer knowledge across multiple languages? To understand this, in this section, let's investigate the multilingual ability of M-BERT in more detail.
Effect of vocabulary overlap
We learned that M-BERT is trained on the Wikipedia text of 104 languages and that it consists of a shared vocabulary of 110k tokens. In this section, let's investigate whether the multilingual knowledge transfer of M-BERT depends on the vocabulary overlap.
We learned that M-BERT is good at zero-shot transfer, that is, we can fine-tune M-BERT in one language and use the fine-tuned M-BERT model in other languages. Let's say we are performing an NER task. Suppose we fine-tune M-BERT for the NER task...