Enhancing your dataset – multilingual, multimodal, and augmentations
Finally, now that you’ve learned how to pick a dataset, compare it with research datasets, determine the right approximate size, and evaluate bias, let’s dive into enhancing the dataset. In particular, we’ll look at a few dimensions – multilingual, multimodal, and augmentations. All three of these typically come a bit later in your ML projects, especially after the first few versions of your models have been trained and you’re looking for the next idea to give you a boost.
Personally, I think there are few applications in the world where multilingually isn’t a strong added value. Multilingual just means multiple languages. While many of the state-of-the-art language models were originally trained on English-only text, researchers in the last few years have made strong efforts to increase the lingual diversity of these corpora. That means they’re adding support...