Bagging techniques for imbalanced data
Imagine a business executive with thousands of confidential files regarding an important merger or acquisition. The analysts assigned to the case don’t have enough time to review all the files. Each can randomly select some files from the set and start reviewing them. Later, they can combine their insights in a meeting to draw conclusions.
This scenario is a metaphor for a process in machine learning called bagging [1], which is short for bootstrap aggregating. In bagging, much like the analysts in the previous scenario, we create several subsets of the original dataset, train a weak learner on each subset, and then aggregate their predictions.
Why use weak learners instead of strong learners? The rationale applies to both bagging and boosting methods (discussed later in this chapter). There are several reasons:
- Speed: Weak learners are computationally efficient and inexpensive to execute.
- Diversity: Weak learners are...