Summary
In this chapter, we discussed undersampling, an approach to address the class imbalance in datasets by reducing the number of samples in the majority class. We reviewed the advantages of undersampling, such as keeping the data size in check and reducing the chances of overfitting. Undersampling methods can be categorized into fixed methods, which reduce the number of majority class samples to a fixed size, and cleaning methods, which reduce majority class samples based on predetermined criteria.
We went over various undersampling techniques, including random undersampling, instance hardness-based undersampling, ClusterCentroids, ENN, Tomek links, NCR, instance hardness, CNNeighbors, one-sided selection, and combinations of undersampling and oversampling techniques, such as SMOTEENN
and SMOTETomek
.
We concluded with a performance comparison of various undersampling techniques from the imbalanced-learn
library on logistic regression and random forest models, using a few...