Imbalanced Datasets
Imbalanced datasets are a distinct case for classification problems where the class distribution varies between the classes. In such datasets, one class is overwhelmingly dominant. In other words, the null accuracy
of an imbalanced dataset is very high.
Consider an example of credit card fraud. If we have a dataset of credit card transactions, then we will find that, of all the transactions, a very minuscule number of transactions were fraudulent and the majority of transactions were normal transactions. If 1
represents a fraudulent transaction and 0
represents a normal transaction, then there will be many 0s and hardly any 1s. The null accuracy
of the dataset may be more than 99%
. This means that the majority class (in this case, 0
) is overwhelmingly greater than the minority class (in this case, 1
). Such sets are imbalanced datasets. Consider the following figure, which shows a general imbalanced dataset scatter plot
: