Challenges of Imbalanced Datasets
As seen from the classifier example, one of the biggest challenges with imbalanced datasets is the bias toward the majority class, which ended up being 88% in the previous example. This will result in suboptimal results. However, what makes such cases even more challenging is the deceptive nature of results if the right metric is not used.
Let's take, for example, a dataset where the negative class is around 99% and the positive class is 1% (as in a use case where a rare disease has to be detected, for instance).
Have a look at the following code snippet:
Data set Size: 10,000 examples Negative class : 9910 Positive Class : 90
Suppose we had a poor classifier that was capable of only predicting the negative class; we would get the following confusion matrix:
From the confusion matrix, let's calculate the accuracy measures. Have a look at the following...