Dealing with an imbalanced dataset
When building a logistic regression model using a dataset whose target is a binary outcome, it could be the case that the target values are not equally distributed. This means that we would observe more non-events (y = 0) than events (y = 1), as is often the case in applications such as fraudulent transactions in banks, spam/phishing emails for corporate employees, identification of diseases such as cancer, and natural disasters such as earthquakes. In these situations, the classification performance may be dominated by the majority class.
Such domination can result in misleadingly high accuracy scores, which correspond to poor predictive performance. To see this, suppose we are developing a default prediction model using a dataset that consists of 1,000 observations, where only 10 (or 1%) of them are default cases. A naive model would simply predict every observation as non-default, resulting in a 99% accuracy.
When we encounter an imbalanced...