Imbalanced Datasets
Imbalanced datasets are a distinct case for classification problems where the class distribution varies between the classes. In such datasets, one class is overwhelmingly dominant. In other words, the null accuracy of an imbalanced dataset is very high. Consider an example of credit card fraud. If we have a dataset of credit card transactions, then we will find that, of all the transactions, a very miniscule number of transactions were fraudulent and the majority of transactions were normal transactions. If 1 represents a fraudulent transaction and 0 represents a normal transaction, then there will be many 0s and hardly any 1s. The null accuracy of the dataset may be more than 99%. This means the majority class (in this case, 0) is overwhelmingly greater than the minority class (in this case, 1). Such sets are imbalanced datasets. The following figure shows a generalized scatter plot of an imbalanced dataset, where the stars represent the minority class and the circles...