Undersampling an imbalanced dataset
A typical case in machine learning is what we call an imbalanced dataset. An imbalanced dataset simply means that for a given class, some occurrences are much more likely than others, hence the lack of balance. There are plenty of cases of imbalanced datasets: rare diseases in medicine, customer behavior, and more.
In this recipe, we will propose one possible way to handle imbalanced datasets: undersampling. After explaining this process, we will apply it to a credit card fraud detection dataset.
Getting ready
The problem with imbalanced data is that it may bias the results of a machine learning model. Let’s assume we’re undertaking a classification task of detecting rare diseases present in only 1% of a dataset. A common pitfall with such data is to have a model predicting as always healthy as it would still have 99% accuracy. So, it would be very likely for a machine learning model to minimize its losses.
Note
In such...