Resampling imbalanced data with SMOTE
Finally, a more complex solution for dealing with imbalanced datasets is a method called SMOTE. After explaining the SMOTE algorithm, we will apply this method to the credit card fraud detection dataset.
Getting ready
SMOTE stands for Synthetic Minority Oversampling TEchnique. As its name suggests, it creates synthetic samples for an underrepresented class. But how exactly does it create synthetic data?
This method uses the k-NN algorithm on the underrepresented class. The SMOTE algorithm can be summarized with the following steps:
- Randomly pick a sample, , in the minority class.
- Using k-NN, randomly pick one of the k-nearest neighbors of in the minority class. Let’s call this sample .
- Compute the new synthetic sample, , with 𝜆 being randomly drawn in the [0, 1] range:
Figure 5.4 – Visual representation of SMOTE
Compared to random oversampling, this method is more...