What is oversampling?
Sampling involves selecting a subset of observations from a larger set of observations. In this chapter, we’ll initially focus on binary classification problems with two classes: the positive class and the negative class. The minority class has significantly fewer instances than the majority class. Later in this chapter, we will explore multi-class classification problems. Toward the end of this chapter, we will look into oversampling for multi-class classification problems.
Oversampling is a data balancing technique that generates more samples of the minority class. However, this can be easily scaled to work for any number of classes where there are multiple classes with an imbalance. Figure 2.1 shows how samples of minority and majority classes are imbalanced (a) initially and balanced (b) after applying an oversampling technique:
Figure 2.1 – An increase in the number of minority class samples after oversampling
...