Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning for Imbalanced Data

You're reading from   Machine Learning for Imbalanced Data Tackle imbalanced datasets using machine learning and deep learning techniques

Arrow left icon
Product type Paperback
Published in Nov 2023
Publisher Packt
ISBN-13 9781801070836
Length 344 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Dr. Mounir Abdelaziz Dr. Mounir Abdelaziz
Author Profile Icon Dr. Mounir Abdelaziz
Dr. Mounir Abdelaziz
Kumar Abhishek Kumar Abhishek
Author Profile Icon Kumar Abhishek
Kumar Abhishek
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Chapter 1: Introduction to Data Imbalance in Machine Learning FREE CHAPTER 2. Chapter 2: Oversampling Methods 3. Chapter 3: Undersampling Methods 4. Chapter 4: Ensemble Methods 5. Chapter 5: Cost-Sensitive Learning 6. Chapter 6: Data Imbalance in Deep Learning 7. Chapter 7: Data-Level Deep Learning Methods 8. Chapter 8: Algorithm-Level Deep Learning Techniques 9. Chapter 9: Hybrid Deep Learning Methods 10. Chapter 10: Model Calibration 11. Assessments 12. Index 13. Other Books You May Enjoy Appendix: Machine Learning Pipeline in Production

Why can imbalanced data be a challenge?

Let’s delve into the difficulties posed by imbalanced data on model predictions and their impact on model performance:

  • Failure of metrics such as accuracy: As we discussed previously, conventional metrics such as accuracy can be misleading in the context of imbalanced data (a 99% imbalanced dataset would still achieve 99% accuracy). Threshold-invariant metrics such as the PR curve or ROC curve attempt to expose the performance of the model over a wide range of thresholds. The real challenge lies in the disproportionate influence of the “true negative” cell in the confusion matrix. Metrics that focus less on “true negatives,” such as precision, recall, or F1 score, are more appropriate for evaluating model performance. It’s important to note that these metrics have a hidden hyperparameter – the classification threshold – that should not be ignored but optimized for real-world applications (refer to Chapter 5, Cost-Sensitive Learning, to learn more about threshold tuning).
  • Imbalanced data can be a challenge for a model’s loss function: This may happen because the loss function is typically designed to minimize the errors between the predicted outputs and the true labels of the training data. When the data is imbalanced, there are more instances of one class than another, and the model may become biased toward the majority class. We will discuss solutions to this issue in more detail in Chapter 5, Cost-Sensitive Learning, and Chapter 8, Algorithm-Level Deep Learning Techniques.
  • Different misclassification costs for different classes: Often, it may be more expensive to misclassify positive examples than to misclassify negative examples. We may have false positives that are more expensive than false negatives. For example, usually, the cost of misclassifying a patient with cancer as healthy (false negative) will be much higher than misclassifying a healthy patient as having cancer (false positive). Why? Because it’s much cheaper to go through some extra tests to revalidate the test results in the second case instead of detecting it much later in the first case. This is called the cost of misclassification, which could be different for the majority and minority classes, making things complicated for imbalanced datasets. We will discuss more about this in Chapter 5, Cost-Sensitive Learning.
  • Constraints on computational resources: In sectors such as finance, healthcare, and retail, handling big data is a common challenge. Training on these large datasets is not only time-consuming but also costly due to the computational power needed. In such scenarios, downsampling or undersampling the majority class becomes essential, as will be discussed in Chapter 3, Undersampling Methods. Additionally, acquiring more samples for the minority class can further increase dataset size and computational costs. Memory limitations may also restrict the amount of data that can be processed.
  • Not enough variation in the minority class examples to sufficiently represent its distribution: Often, an absolute number of samples of the minority class is not as big of a problem as the variation in the samples of the minority class. The dataset might look large, but there might not be many variations or varieties in the samples that adequately represent the distribution of minority classes. This can lead to the model not being able to learn the classification boundary properly, which would lead to poor performance of the model (Figure 1.10). This can often happen in computer vision problems, such as object detection, where we may have very few samples of certain classes. In such cases, data augmentation techniques (discussed in Chapter 7, Data-Level Deep Learning Methods) can help significantly:

Figure 1.10 – Change in decision boundary with a different distribution of minority class examples – the crosses denote the majority class, and the circles denote the minority class

  • Poor performance of uncalibrated models: Imbalanced data can be a challenge for uncalibrated models. Uncalibrated models are models that do not output well-calibrated probabilities, which means that the predicted probabilities may not reflect the true likelihood of the predicted classes:
    • When dealing with imbalanced data, uncalibrated models can be particularly susceptible to producing biased predictions toward the majority class as they may not be able to effectively differentiate between the minority and majority classes. This can lead to poor performance in the minority class, where the model may produce overly confident predictions or predictions that are too conservative.
    • For example, an uncalibrated model that is trained on imbalanced data may incorrectly classify instances that belong to the minority class as majority class examples, often with high confidence. This is because the model may not have learned to adjust its predictions based on the imbalance in the data and may not have a good understanding of the minority class examples.
    • To address this challenge, it is important to use well-calibrated models [4] that can output probabilities that reflect the true likelihood of the predicted classes. This can be achieved through techniques such as Platt scaling or isotonic regression, which can calibrate the predicted probabilities of an uncalibrated model to produce more accurate and reliable probabilities. Model calibration will be discussed in detail in Chapter 10, Model Calibration.
  • Poor performance of models because of non-adjusted thresholds: It’s important to use intelligent thresholding when making predictions using models trained on imbalanced datasets. Simply predicting 1 when the model probability is over 0.5 may not always be the best approach. Instead, we should consider other thresholds that may be more effective. This can be achieved by examining the PR curve of the model rather than relying solely on its success rate with a default probability threshold of 0.5. Threshold adjustment can be quite important, even for models trained on naturally or artificially balanced datasets. We will discuss threshold adjustment in detail in Chapter 5, Cost-Sensitive Learning.

Next, let’s try to see when we shouldn’t do anything about data imbalance.

You have been reading a chapter from
Machine Learning for Imbalanced Data
Published in: Nov 2023
Publisher: Packt
ISBN-13: 9781801070836
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image