Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Machine Learning for Imbalanced Data
Machine Learning for Imbalanced Data

Machine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques

Arrow left icon
Profile Icon Kumar Abhishek Profile Icon Dr. Mounir Abdelaziz
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (17 Ratings)
Paperback Nov 2023 344 pages 1st Edition
eBook
$27.98 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Kumar Abhishek Profile Icon Dr. Mounir Abdelaziz
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (17 Ratings)
Paperback Nov 2023 344 pages 1st Edition
eBook
$27.98 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$27.98 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Machine Learning for Imbalanced Data

Oversampling Methods

In machine learning, we often don’t have enough samples of the minority class. One solution might be to gather more samples of such a class. For example, in the problem of detecting whether a patient has cancer or not, if we don’t have enough samples of the cancer class, we can wait for some time to gather more samples. However, such a strategy is not always feasible or sensible and can be time-consuming. In such cases, we can augment our data by using various techniques. One such technique is oversampling.

In this chapter, we will introduce the concept of oversampling, discuss when to use it, and the various techniques to perform it. We will also demonstrate how to utilize these techniques through the imbalanced-learn library APIs and compare their performance using some classical machine learning models. Finally, we will conclude with some practical advice on which techniques tend to work best under specific real-world conditions.

In this...

Technical requirements

In this chapter, we will utilize common libraries such as numpy, scikit-learn, and imbalanced-learn. The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Imbalanced-Data/tree/master/chapter02. You can just fire up the GitHub notebook using Google Colab by clicking on the Open in Colab icon at the top of this chapter’s notebook or by launching it from https://colab.research.google.com using the GitHub URL of the notebook.

What is oversampling?

Sampling involves selecting a subset of observations from a larger set of observations. In this chapter, we’ll initially focus on binary classification problems with two classes: the positive class and the negative class. The minority class has significantly fewer instances than the majority class. Later in this chapter, we will explore multi-class classification problems. Toward the end of this chapter, we will look into oversampling for multi-class classification problems.

Oversampling is a data balancing technique that generates more samples of the minority class. However, this can be easily scaled to work for any number of classes where there are multiple classes with an imbalance. Figure 2.1 shows how samples of minority and majority classes are imbalanced (a) initially and balanced (b) after applying an oversampling technique:

Figure 2.1 – An increase in the number of minority class samples after oversampling

...

Random oversampling

The simplest strategy to balance the imbalance in a dataset is to randomly choose samples of the minority class and repeat or duplicate them. This is also called random oversampling with replacement.

To increase the number of minority class observations, we can replicate the minority class data observations enough times to balance the two classes. Does this sound too trivial? Yes, but it works. By increasing the number of minority class samples, random oversampling reduces the bias toward the majority class. This helps the model learn the patterns and characteristics of the minority class more effectively.

We will use random oversampling from the imbalanced-learn library. The fit_resample API from the RandomOverSampler class resamples the original dataset and balances it. The sampling_strategy parameter is used to specify the new ratio of various classes. For example, we could say sampling_strategy=1.0 to have an equal number of the two classes.

There are...

SMOTE

The main problem with random oversampling is that it duplicates the observations from the minority class. This can often cause overfitting. Synthetic Minority Oversampling Technique (SMOTE) [2] solves this problem of duplication by using a technique called interpolation.

Interpolation involves creating new data points in the range of known data points. Think of interpolation as being similar to the process of reproduction in biology. In reproduction, two individuals come together to produce a new individual with traits of both of them. Similarly, in interpolation, we pick two observations from the dataset and create a new observation by choosing a random point on the line joining the two selected points.

We oversample the minority class by interpolating synthetic examples. That prevents the duplication of minority samples while generating new synthetic observations similar to the known points. Figure 2.5 depicts how SMOTE works:

Figure 2.5 –...

SMOTE variants

Now, let’s look at some of the SMOTE variants, such as Borderline-SMOTE, SMOTE-NC, and SMOTEN. These variants apply the SMOTE algorithm to samples of a certain kind and may not always be applicable.

Borderline-SMOTE

Borderline-SMOTE [4] is a variation of SMOTE that generates synthetic samples from the minority class samples that are near the classification boundary, which divides the majority class from the minority class.

Why consider samples on the classification boundary?

The idea is that the examples near the classification boundary are more prone to misclassification than those far away from the decision boundary. Producing more such minority samples along the boundary would help the model learn better about the minority class. Intuitively, it is also true that the points away from the classification boundary likely won’t make the model a better classifier.

Here’s a step-by-step algorithm for Borderline-SMOTE:

  1. We run a...

ADASYN

While SMOTE doesn’t distinguish between the density distribution of minority class samples, Adaptive Synthetic Sampling (ADASYN) [6] focuses on harder-to-classify minority class samples since they are in a low-density area. ADASYN uses a weighted distribution of the minority class based on the difficulty of classifying the observations. This way, more synthetic data is generated from harder samples:

Figure 2.11 – Illustration of how ADASYN works

Here, we can see the following:

  • a) The majority and minority class samples are plotted
  • b) Synthetic samples are generated depending on the hardness factor (explained later)

While SMOTE uses all samples from the minority class for oversampling uniformly, in ADASYN, the observations that are harder to classify are used more often.

Another difference between the two techniques is that, unlike SMOTE, ADASYN also uses the majority class observations while training KNN. It then...

Model performance comparison of various oversampling methods

Let’s examine how some popular models perform with the different oversampling techniques we’ve discussed. We’ll use two datasets for this comparison: one synthetic and one real-world dataset. We’ll evaluate the performance of four oversampling techniques, as well as no sampling, using logistic regression and random forest models.

You can find all the related code in this book’s GitHub repository. In Figure 2.15 and Figure 2.16, we can see the average precision score values for both models on the two datasets:

Figure 2.15 – Performance comparison of various oversampling techniques on a synthetic dataset

Figure 2.16 – Performance comparison of various oversampling techniques on the thyroid_sick dataset

Based on these plots, we can draw some useful conclusions:

  • Effectiveness of oversampling: In general, using...

Guidance for using various oversampling techniques

Now, let’s review some guidelines on how to navigate through the various oversampling techniques we went over and how these techniques differ from each other:

  1. Train a model without applying any sampling techniques. This will be our model with baseline performance. Any oversampling technique we apply is expected to give a boost to this performance.
  2. Start with random oversampling and add some shrinkage too. We may have to play with some values of shrinkage to see if the model’s performance improves.
  3. When we have categorical features, we have a couple of options:
    1. Convert all categorical features into numerical features first using one-hot encoding, label encoding, feature hashing, or other feature transformation techniques.
    2. (Only for nominal categorical features) Use SMOTENC and SMOTEN directly on the data.
  4. Apply various oversampling techniques – random oversampling, SMOTE, Borderline-SMOTE, and...

Oversampling in multi-class classification

In multi-class classification problems, we have more than two classes or labels to be predicted, and hence more than one class may be imbalanced. This adds some more complexity to the problem. However, we can apply the same techniques to multi-class classification problems as well. The imbalanced-learn library provides the option to deal with multi-class classification in almost all the supported methods. We can choose from various sampling strategies using the sampling_strategy parameter. For multi-class classification, we can pass some fixed string values (called built-in strategies) to the sampling_strategy parameter in the SMOTE API. We can also pass a dictionary with the following:

  • Keys as the class labels
  • Values as the number of samples of that class

Here are the built-in strategies for sampling_strategy when using the parameter as a string:

  • The minority strategy resamples only the minority class.
  • The not...

Summary

In this chapter, we went through various oversampling techniques for dealing with imbalanced datasets and applied them using Python’s imbalanced-learn library (also called imblearn). We also saw the internal workings of some of the techniques by implementing them from scratch. While random oversampling generates new minority class samples by duplicating them, SMOTE-based techniques work by choosing random samples in the direction of nearest neighbors of the minority class samples. Though oversampling can potentially overfit the model on your data, it usually has more pros than cons, depending on the data and model.

We applied them to some of the synthesized and publicly available datasets and benchmarked their performance and effectiveness. We saw how different oversampling techniques may lead to model performance on a varying scale, so it becomes crucial to try a few different oversampling techniques to decide on the one that’s most optimal for our data.

...

Exercises

  1. Explore the two variants of SMOTE, namely KMeans-SMOTE and SVM-SMOTE, from the imbalanced-learn library, not discussed in this chapter. Compare their performance with vanilla SMOTE, Borderline-SMOTE, and ADASYN using the logistic regression and random forest models.
  2. For a classification problem with two classes, let’s say the minority class to majority class ratio is 1:20. How should we balance this dataset? Should we apply the balancing technique at test or evaluation time? Please provide a reason for your answer.
  3. Let’s say we are trying to build a model that can estimate whether a person can be granted a bank loan or not. Out of the 5,000 observations we have, only 500 people got the loan approved. To balance the dataset, we duplicate the approved people data and then split it into train, test, and validation datasets. Are there any issues with using this approach?
  4. Data normalization helps in dealing with data imbalance. Is this true? Why...

References

  1. Protecting Personal Data in Grab’s Imagery (2021), https://engineering.grab.com/protecting-personal-data-in-grabs-imagery.
  2. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Technique, jair, vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.
  3. Live Site Incident escalation forecast (2023), https://medium.com/data-science-at-microsoft/live-site-incident-escalation-forecast-566763a2178.
  4. H. Han, W.-Y. Wang, and B.-H. Mao, Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, in Advances in Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds., in Lecture Notes in Computer Science, vol. 3644. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 878–887. doi: 10.1007/11538059_91.
  5. P. Meiyappan and M. Bales, Position Paper: Reducing Amazon’s packaging waste using multimodal deep learning, (2021), article: https://www.amazon.science...
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand how to use modern machine learning frameworks with detailed explanations, illustrations, and code samples
  • Learn cutting-edge deep learning techniques to overcome data imbalance
  • Explore different methods for dealing with skewed data in ML and DL applications
  • Purchase of the print or Kindle book includes a free eBook in the PDF format

Description

As machine learning practitioners, we often encounter imbalanced datasets in which one class has considerably fewer instances than the other. Many machine learning algorithms assume an equilibrium between majority and minority classes, leading to suboptimal performance on imbalanced data. This comprehensive guide helps you address this class imbalance to significantly improve model performance. Machine Learning for Imbalanced Data begins by introducing you to the challenges posed by imbalanced datasets and the importance of addressing these issues. It then guides you through techniques that enhance the performance of classical machine learning models when using imbalanced data, including various sampling and cost-sensitive learning methods. As you progress, you’ll delve into similar and more advanced techniques for deep learning models, employing PyTorch as the primary framework. Throughout the book, hands-on examples will provide working and reproducible code that’ll demonstrate the practical implementation of each technique. By the end of this book, you’ll be adept at identifying and addressing class imbalances and confidently applying various techniques, including sampling, cost-sensitive techniques, and threshold adjustment, while using traditional machine learning or deep learning models.

Who is this book for?

This book is for machine learning practitioners who want to effectively address the challenges of imbalanced datasets in their projects. Data scientists, machine learning engineers/scientists, research scientists/engineers, and data scientists/engineers will find this book helpful. Though complete beginners are welcome to read this book, some familiarity with core machine learning concepts will help readers maximize the benefits and insights gained from this comprehensive resource.

What you will learn

  • Use imbalanced data in your machine learning models effectively
  • Explore the metrics used when classes are imbalanced
  • Understand how and when to apply various sampling methods such as over-sampling and under-sampling
  • Apply data-based, algorithm-based, and hybrid approaches to deal with class imbalance
  • Combine and choose from various options for data balancing while avoiding common pitfalls
  • Understand the concepts of model calibration and threshold adjustment in the context of dealing with imbalanced datasets

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Nov 30, 2023
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781801070836
Category :
Languages :
Concepts :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Nov 30, 2023
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781801070836
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 139.97
Causal Inference and Discovery in Python
$39.99
Machine Learning for Imbalanced Data
$49.99
Python Deep Learning
$49.99
Total $ 139.97 Stars icon

Table of Contents

13 Chapters
Chapter 1: Introduction to Data Imbalance in Machine Learning Chevron down icon Chevron up icon
Chapter 2: Oversampling Methods Chevron down icon Chevron up icon
Chapter 3: Undersampling Methods Chevron down icon Chevron up icon
Chapter 4: Ensemble Methods Chevron down icon Chevron up icon
Chapter 5: Cost-Sensitive Learning Chevron down icon Chevron up icon
Chapter 6: Data Imbalance in Deep Learning Chevron down icon Chevron up icon
Chapter 7: Data-Level Deep Learning Methods Chevron down icon Chevron up icon
Chapter 8: Algorithm-Level Deep Learning Techniques Chevron down icon Chevron up icon
Chapter 9: Hybrid Deep Learning Methods Chevron down icon Chevron up icon
Chapter 10: Model Calibration Chevron down icon Chevron up icon
Assessments Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(17 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Ranja Feb 06, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book on tackling real imbalanced datasets in machine learning is a detailed and comprehensive guide. The chapters ‘cost-sensitive learning’ and ‘model calibration’ require special mention, which were blended in well with other chapters on over-sampling, under-sampling and ensemble techniques for handling data imbalance. While some essential concepts have in-depth explanations and rightfully so, the authors have managed well to keep the book intriguing all along which makes it a prized resource for all machine learning practitioners.
Amazon Verified review Amazon
Advitya Gemawat Jan 07, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book covers various methods to address the class imbalance problem, and covers usage with popular python libraries and typical evaluation metrics from the lens of class imbalance.Here're some of my top takeaways from the book:🎲 Sampling methods, such as over-sampling, under-sampling, and hybrid sampling, to balance the data distribution📊 Cost-sensitive learning, which assigns different weights or costs to different classes, to make the model more sensitive to the minority class📈 Threshold adjustment, which modifies the decision threshold of the model, to improve the performance metrics🗂 Model calibration, which adjusts the predicted probabilities of the model, to make them more reliable and interpretable🚀 My favorite part of the book: How several big tech companies are solving data imbalance challenges in different contexts🗃 There's a python library `imbalanced-learn` that offers out-of-the-box techniques to deal with data imbalance and can also be used to create corresponding synthetic datasetsHaving read several books from Packt, it's so interesting to go through these books as they deal with very specific subtopics within ML and provide an entire landscape of practical techniques, real-world use-cases, and top takeaways for practitioners based on research findings.
Amazon Verified review Amazon
H2N Dec 14, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Machine Learning for Imbalanced Data is a helpful guide to deal with imbalanced data in machine learning. The authors talked about various strategies and best practices to address the complexity of data imbalance, underscoring the importance of context. Lots of techniques were covered in the book such as oversampling methods to deep learning approaches with real-world applications. A nice book for anyone to learn and work in machine learning.
Amazon Verified review Amazon
Ashish Tiwari Dec 14, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
"Machine Learning for Imbalanced Data" is an insightful, 300+ page journey into the complexities of machine learning, especially tailored for those with some prior experience. It's a well-crafted guide that demystifies topics like oversampling, undersampling, deep learning techniques, and model calibration with rich details. The book excels in blending theoretical concepts with practical Python code examples, making it a valuable reference for real-world applications. Its approachable style, coupled with comprehensive content, makes it an indispensable resource for anyone looking to master the intricacies of machine learning in the context of imbalanced data.
Amazon Verified review Amazon
Snigdha Dec 31, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book provided a great overview in a concise and clear format of dealing with imbalanced datasets and what techniques to use. The text contains helpful examples and insights from the author's industry experience. I enjoyed the cartoon strips added in chapters for easy understanding. The collab notebooks provided in the GitHub repo provide the coding practice needed to utilize the theory in the book. I would recommend this to anyone learning more about machine learning as most of the datasets in real life are imbalanced.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.