Generating Synthetic Samples
In the previous section, we looked at the undersampling method, where we downsized the majority class to make the dataset balanced. However, when undersampling, we reduced the size of the dataset. In many circumstances, downsizing the dataset can have adverse effects on the predictive power of the classifier. An effective way to counter the downsizing of the dataset is to oversample the minority class. Oversampling is done by generating new synthetic data points similar to those of the minority class, thereby balancing the dataset.
Two very popular methods for generating such synthetic points are:
- Synthetic Minority Oversampling Technique (SMOTE)
- Modified SMOTE (MSMOTE)
The way the SMOTE
algorithm generates synthetic data is by looking at the neighborhood of minority classes and generating new data points within the neighborhood:
Let's explain the concept of generating...