Chapter 2 – Oversampling Methods
- This is left as an exercise for you.
- One approach is to oversample the minority class by 20x to balance both classes. It’s important to note that achieving the perfect balance between the classes is not always necessary; a slight imbalance may be acceptable, depending on the specific requirements and constraints. This technique is not applied at test time as the test data should remain representative of what we would encounter in the real world.
- The primary concern with oversampling before splitting the data into training, test, and validation sets is data leakage. This occurs when duplicate samples end up in both the training and test/validation sets, leading to overly optimistic performance metrics. The model may perform well during evaluation because it has already seen the same examples during training, but this can result in poor generalization to new, unseen data. To mitigate this risk, it’s crucial to first split...