Oversampling an imbalanced dataset
Another solution when dealing with imbalanced datasets is random oversampling. This is the opposite of random undersampling. In this recipe, we’ll learn how to use it on the credit card fraud detection dataset.
Getting ready
Random oversampling can be seen as the opposite of random undersampling: the idea is to duplicate samples of the underrepresented dataset to rebalance the dataset.
As for the previous recipe, let’s assume a 1%-99% imbalanced dataset that contains the following:
- 100 samples with disease
- 9,900 samples with no disease
To apply oversampling to this dataset using a 1/1 strategy (so, a perfectly balanced dataset), we would need to have 99 duplicates of each sample of the disease class. So, the oversampled dataset would need to contain the following:
- 9,900 samples with disease (100 original samples duplicated 99 times on average)
- 9,900 samples with no disease
We can easily...