Addressing class imbalance
When faced with a class imbalance in our data, we may want to try to balance the training data before we build a model around it. In order to do this, we can use one of the following imbalanced sampling techniques:
- Over-sample the minority class.
- Under-sample the majority class.
In the case of over-sampling, we pick a larger proportion from the minority class in order to get closer to the amount of the majority class; this may involve a technique such as bootstrapping or generating new data similar to the values in the existing data (using machine learning algorithms such as nearest neighbors). Under-sampling, on the other hand, will take less data overall by reducing the amount taken from the majority class. The decision to use over-sampling or under-sampling will depend on the amount of data we started with, and in some cases, computational costs. In practice, we wouldn't try either of these without first trying to build the model...