Data imbalance in deep learning
While many classical machine learning problems that use tabular data are limited to binary classes and are interested in predicting the minority class, this is not the norm in domains where deep learning is often applied, especially computer vision or NLP problems.
Even benchmark datasets such as MNIST (a collection of handwritten digits containing grayscale images from 0 to 9) and CIFAR10 (color images with 10 different classes) have 10 classes to predict. So, we can say that multi-class classification is typical in problems that use deep learning models.
This data skew or imbalance can severely impact the model performance. We should review what we discussed about the typical kinds of imbalance in datasets in Chapter 1, Introduction to Data Imbalance in Machine Learning. To simulate real-world data imbalance scenarios, two types of imbalance are usually investigated in the literature:
- Step imbalance: All the minority classes have the...