Introduction
In the previous chapter, Chapter 12, Feature Engineering, where we dealt with data points related to dates, we were addressing scenarios pertaining to features. In this chapter, we will deal with scenarios where the proportions of examples in the overall dataset pose challenges.
Let's revisit the dataset we dealt with in Chapter 3, Binary Classification, in which the examples pertaining to 'No' for term deposits far outnumbered the ones with 'Yes' with a ratio of 88% to 12%. We also determined that one reason for suboptimal results with a logistic regression model on that dataset was the skewed proportion of examples. Datasets like the one we analyzed in Chapter 3, Binary Classification, which are called imbalanced datasets, are very common in real-world use cases.
Some of the use cases where we encounter imbalanced datasets include the following:
- Fraud detection for credit cards or insurance claims
- Medical diagnoses where we...