When faced with a class imbalance in our data, we may want to try to balance the training data before we build a model around it. In order to do this, we can use one of the following imbalanced sampling techniques:
- Over-sample the minority class
- Under-sample the majority class
In the case of over-sampling, we pick a larger proportion from the class with fewer values in order to come closer to the amount of the majority class; this may involve a technique such as bootstrapping, or generating new data similar to the values in the existing data (using machine learning algorithms such as nearest neighbors). Under-sampling, on the other hand, will take less data overall by reducing the amount taken from the majority class. The decision to use over-sampling or under-sampling will depend on the amount of data we started with, and in some cases, computational...