SMOTE
The main problem with random oversampling is that it duplicates the observations from the minority class. This can often cause overfitting. Synthetic Minority Oversampling Technique (SMOTE) [2] solves this problem of duplication by using a technique called interpolation.
Interpolation involves creating new data points in the range of known data points. Think of interpolation as being similar to the process of reproduction in biology. In reproduction, two individuals come together to produce a new individual with traits of both of them. Similarly, in interpolation, we pick two observations from the dataset and create a new observation by choosing a random point on the line joining the two selected points.
We oversample the minority class by interpolating synthetic examples. That prevents the duplication of minority samples while generating new synthetic observations similar to the known points. Figure 2.5 depicts how SMOTE works:
Figure 2.5 –...