Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Machine Learning for Imbalanced Data
Machine Learning for Imbalanced Data

Machine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques

Arrow left icon
Profile Icon Kumar Abhishek Profile Icon Dr. Mounir Abdelaziz
Arrow right icon
€8.99 €29.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (17 Ratings)
eBook Nov 2023 344 pages 1st Edition
eBook
€8.99 €29.99
Paperback
€37.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Kumar Abhishek Profile Icon Dr. Mounir Abdelaziz
Arrow right icon
€8.99 €29.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (17 Ratings)
eBook Nov 2023 344 pages 1st Edition
eBook
€8.99 €29.99
Paperback
€37.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.99 €29.99
Paperback
€37.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Machine Learning for Imbalanced Data

Oversampling Methods

In machine learning, we often don’t have enough samples of the minority class. One solution might be to gather more samples of such a class. For example, in the problem of detecting whether a patient has cancer or not, if we don’t have enough samples of the cancer class, we can wait for some time to gather more samples. However, such a strategy is not always feasible or sensible and can be time-consuming. In such cases, we can augment our data by using various techniques. One such technique is oversampling.

In this chapter, we will introduce the concept of oversampling, discuss when to use it, and the various techniques to perform it. We will also demonstrate how to utilize these techniques through the imbalanced-learn library APIs and compare their performance using some classical machine learning models. Finally, we will conclude with some practical advice on which techniques tend to work best under specific real-world conditions.

In this...

Technical requirements

In this chapter, we will utilize common libraries such as numpy, scikit-learn, and imbalanced-learn. The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Imbalanced-Data/tree/master/chapter02. You can just fire up the GitHub notebook using Google Colab by clicking on the Open in Colab icon at the top of this chapter’s notebook or by launching it from https://colab.research.google.com using the GitHub URL of the notebook.

What is oversampling?

Sampling involves selecting a subset of observations from a larger set of observations. In this chapter, we’ll initially focus on binary classification problems with two classes: the positive class and the negative class. The minority class has significantly fewer instances than the majority class. Later in this chapter, we will explore multi-class classification problems. Toward the end of this chapter, we will look into oversampling for multi-class classification problems.

Oversampling is a data balancing technique that generates more samples of the minority class. However, this can be easily scaled to work for any number of classes where there are multiple classes with an imbalance. Figure 2.1 shows how samples of minority and majority classes are imbalanced (a) initially and balanced (b) after applying an oversampling technique:

Figure 2.1 – An increase in the number of minority class samples after oversampling

...

Random oversampling

The simplest strategy to balance the imbalance in a dataset is to randomly choose samples of the minority class and repeat or duplicate them. This is also called random oversampling with replacement.

To increase the number of minority class observations, we can replicate the minority class data observations enough times to balance the two classes. Does this sound too trivial? Yes, but it works. By increasing the number of minority class samples, random oversampling reduces the bias toward the majority class. This helps the model learn the patterns and characteristics of the minority class more effectively.

We will use random oversampling from the imbalanced-learn library. The fit_resample API from the RandomOverSampler class resamples the original dataset and balances it. The sampling_strategy parameter is used to specify the new ratio of various classes. For example, we could say sampling_strategy=1.0 to have an equal number of the two classes.

There are...

SMOTE

The main problem with random oversampling is that it duplicates the observations from the minority class. This can often cause overfitting. Synthetic Minority Oversampling Technique (SMOTE) [2] solves this problem of duplication by using a technique called interpolation.

Interpolation involves creating new data points in the range of known data points. Think of interpolation as being similar to the process of reproduction in biology. In reproduction, two individuals come together to produce a new individual with traits of both of them. Similarly, in interpolation, we pick two observations from the dataset and create a new observation by choosing a random point on the line joining the two selected points.

We oversample the minority class by interpolating synthetic examples. That prevents the duplication of minority samples while generating new synthetic observations similar to the known points. Figure 2.5 depicts how SMOTE works:

Figure 2.5 –...

SMOTE variants

Now, let’s look at some of the SMOTE variants, such as Borderline-SMOTE, SMOTE-NC, and SMOTEN. These variants apply the SMOTE algorithm to samples of a certain kind and may not always be applicable.

Borderline-SMOTE

Borderline-SMOTE [4] is a variation of SMOTE that generates synthetic samples from the minority class samples that are near the classification boundary, which divides the majority class from the minority class.

Why consider samples on the classification boundary?

The idea is that the examples near the classification boundary are more prone to misclassification than those far away from the decision boundary. Producing more such minority samples along the boundary would help the model learn better about the minority class. Intuitively, it is also true that the points away from the classification boundary likely won’t make the model a better classifier.

Here’s a step-by-step algorithm for Borderline-SMOTE:

  1. We run a...

ADASYN

While SMOTE doesn’t distinguish between the density distribution of minority class samples, Adaptive Synthetic Sampling (ADASYN) [6] focuses on harder-to-classify minority class samples since they are in a low-density area. ADASYN uses a weighted distribution of the minority class based on the difficulty of classifying the observations. This way, more synthetic data is generated from harder samples:

Figure 2.11 – Illustration of how ADASYN works

Here, we can see the following:

  • a) The majority and minority class samples are plotted
  • b) Synthetic samples are generated depending on the hardness factor (explained later)

While SMOTE uses all samples from the minority class for oversampling uniformly, in ADASYN, the observations that are harder to classify are used more often.

Another difference between the two techniques is that, unlike SMOTE, ADASYN also uses the majority class observations while training KNN. It then...

Model performance comparison of various oversampling methods

Let’s examine how some popular models perform with the different oversampling techniques we’ve discussed. We’ll use two datasets for this comparison: one synthetic and one real-world dataset. We’ll evaluate the performance of four oversampling techniques, as well as no sampling, using logistic regression and random forest models.

You can find all the related code in this book’s GitHub repository. In Figure 2.15 and Figure 2.16, we can see the average precision score values for both models on the two datasets:

Figure 2.15 – Performance comparison of various oversampling techniques on a synthetic dataset

Figure 2.16 – Performance comparison of various oversampling techniques on the thyroid_sick dataset

Based on these plots, we can draw some useful conclusions:

  • Effectiveness of oversampling: In general, using...

Guidance for using various oversampling techniques

Now, let’s review some guidelines on how to navigate through the various oversampling techniques we went over and how these techniques differ from each other:

  1. Train a model without applying any sampling techniques. This will be our model with baseline performance. Any oversampling technique we apply is expected to give a boost to this performance.
  2. Start with random oversampling and add some shrinkage too. We may have to play with some values of shrinkage to see if the model’s performance improves.
  3. When we have categorical features, we have a couple of options:
    1. Convert all categorical features into numerical features first using one-hot encoding, label encoding, feature hashing, or other feature transformation techniques.
    2. (Only for nominal categorical features) Use SMOTENC and SMOTEN directly on the data.
  4. Apply various oversampling techniques – random oversampling, SMOTE, Borderline-SMOTE, and...

Oversampling in multi-class classification

In multi-class classification problems, we have more than two classes or labels to be predicted, and hence more than one class may be imbalanced. This adds some more complexity to the problem. However, we can apply the same techniques to multi-class classification problems as well. The imbalanced-learn library provides the option to deal with multi-class classification in almost all the supported methods. We can choose from various sampling strategies using the sampling_strategy parameter. For multi-class classification, we can pass some fixed string values (called built-in strategies) to the sampling_strategy parameter in the SMOTE API. We can also pass a dictionary with the following:

  • Keys as the class labels
  • Values as the number of samples of that class

Here are the built-in strategies for sampling_strategy when using the parameter as a string:

  • The minority strategy resamples only the minority class.
  • The not...

Summary

In this chapter, we went through various oversampling techniques for dealing with imbalanced datasets and applied them using Python’s imbalanced-learn library (also called imblearn). We also saw the internal workings of some of the techniques by implementing them from scratch. While random oversampling generates new minority class samples by duplicating them, SMOTE-based techniques work by choosing random samples in the direction of nearest neighbors of the minority class samples. Though oversampling can potentially overfit the model on your data, it usually has more pros than cons, depending on the data and model.

We applied them to some of the synthesized and publicly available datasets and benchmarked their performance and effectiveness. We saw how different oversampling techniques may lead to model performance on a varying scale, so it becomes crucial to try a few different oversampling techniques to decide on the one that’s most optimal for our data.

...

Exercises

  1. Explore the two variants of SMOTE, namely KMeans-SMOTE and SVM-SMOTE, from the imbalanced-learn library, not discussed in this chapter. Compare their performance with vanilla SMOTE, Borderline-SMOTE, and ADASYN using the logistic regression and random forest models.
  2. For a classification problem with two classes, let’s say the minority class to majority class ratio is 1:20. How should we balance this dataset? Should we apply the balancing technique at test or evaluation time? Please provide a reason for your answer.
  3. Let’s say we are trying to build a model that can estimate whether a person can be granted a bank loan or not. Out of the 5,000 observations we have, only 500 people got the loan approved. To balance the dataset, we duplicate the approved people data and then split it into train, test, and validation datasets. Are there any issues with using this approach?
  4. Data normalization helps in dealing with data imbalance. Is this true? Why...

References

  1. Protecting Personal Data in Grab’s Imagery (2021), https://engineering.grab.com/protecting-personal-data-in-grabs-imagery.
  2. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Technique, jair, vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.
  3. Live Site Incident escalation forecast (2023), https://medium.com/data-science-at-microsoft/live-site-incident-escalation-forecast-566763a2178.
  4. H. Han, W.-Y. Wang, and B.-H. Mao, Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, in Advances in Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds., in Lecture Notes in Computer Science, vol. 3644. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 878–887. doi: 10.1007/11538059_91.
  5. P. Meiyappan and M. Bales, Position Paper: Reducing Amazon’s packaging waste using multimodal deep learning, (2021), article: https://www.amazon.science...
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand how to use modern machine learning frameworks with detailed explanations, illustrations, and code samples
  • Learn cutting-edge deep learning techniques to overcome data imbalance
  • Explore different methods for dealing with skewed data in ML and DL applications
  • Purchase of the print or Kindle book includes a free eBook in the PDF format

Description

As machine learning practitioners, we often encounter imbalanced datasets in which one class has considerably fewer instances than the other. Many machine learning algorithms assume an equilibrium between majority and minority classes, leading to suboptimal performance on imbalanced data. This comprehensive guide helps you address this class imbalance to significantly improve model performance. Machine Learning for Imbalanced Data begins by introducing you to the challenges posed by imbalanced datasets and the importance of addressing these issues. It then guides you through techniques that enhance the performance of classical machine learning models when using imbalanced data, including various sampling and cost-sensitive learning methods. As you progress, you’ll delve into similar and more advanced techniques for deep learning models, employing PyTorch as the primary framework. Throughout the book, hands-on examples will provide working and reproducible code that’ll demonstrate the practical implementation of each technique. By the end of this book, you’ll be adept at identifying and addressing class imbalances and confidently applying various techniques, including sampling, cost-sensitive techniques, and threshold adjustment, while using traditional machine learning or deep learning models.

Who is this book for?

This book is for machine learning practitioners who want to effectively address the challenges of imbalanced datasets in their projects. Data scientists, machine learning engineers/scientists, research scientists/engineers, and data scientists/engineers will find this book helpful. Though complete beginners are welcome to read this book, some familiarity with core machine learning concepts will help readers maximize the benefits and insights gained from this comprehensive resource.

What you will learn

  • Use imbalanced data in your machine learning models effectively
  • Explore the metrics used when classes are imbalanced
  • Understand how and when to apply various sampling methods such as over-sampling and under-sampling
  • Apply data-based, algorithm-based, and hybrid approaches to deal with class imbalance
  • Combine and choose from various options for data balancing while avoiding common pitfalls
  • Understand the concepts of model calibration and threshold adjustment in the context of dealing with imbalanced datasets

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Nov 30, 2023
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781801070881
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Nov 30, 2023
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781801070881
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 105.97
Causal Inference and Discovery in Python
€29.99
Machine Learning for Imbalanced Data
€37.99
Python Deep Learning
€37.99
Total 105.97 Stars icon
Banner background image

Table of Contents

13 Chapters
Chapter 1: Introduction to Data Imbalance in Machine Learning Chevron down icon Chevron up icon
Chapter 2: Oversampling Methods Chevron down icon Chevron up icon
Chapter 3: Undersampling Methods Chevron down icon Chevron up icon
Chapter 4: Ensemble Methods Chevron down icon Chevron up icon
Chapter 5: Cost-Sensitive Learning Chevron down icon Chevron up icon
Chapter 6: Data Imbalance in Deep Learning Chevron down icon Chevron up icon
Chapter 7: Data-Level Deep Learning Methods Chevron down icon Chevron up icon
Chapter 8: Algorithm-Level Deep Learning Techniques Chevron down icon Chevron up icon
Chapter 9: Hybrid Deep Learning Methods Chevron down icon Chevron up icon
Chapter 10: Model Calibration Chevron down icon Chevron up icon
Assessments Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(17 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Ranja Feb 06, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book on tackling real imbalanced datasets in machine learning is a detailed and comprehensive guide. The chapters ‘cost-sensitive learning’ and ‘model calibration’ require special mention, which were blended in well with other chapters on over-sampling, under-sampling and ensemble techniques for handling data imbalance. While some essential concepts have in-depth explanations and rightfully so, the authors have managed well to keep the book intriguing all along which makes it a prized resource for all machine learning practitioners.
Amazon Verified review Amazon
Advitya Gemawat Jan 07, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book covers various methods to address the class imbalance problem, and covers usage with popular python libraries and typical evaluation metrics from the lens of class imbalance.Here're some of my top takeaways from the book:🎲 Sampling methods, such as over-sampling, under-sampling, and hybrid sampling, to balance the data distribution📊 Cost-sensitive learning, which assigns different weights or costs to different classes, to make the model more sensitive to the minority class📈 Threshold adjustment, which modifies the decision threshold of the model, to improve the performance metrics🗂 Model calibration, which adjusts the predicted probabilities of the model, to make them more reliable and interpretable🚀 My favorite part of the book: How several big tech companies are solving data imbalance challenges in different contexts🗃 There's a python library `imbalanced-learn` that offers out-of-the-box techniques to deal with data imbalance and can also be used to create corresponding synthetic datasetsHaving read several books from Packt, it's so interesting to go through these books as they deal with very specific subtopics within ML and provide an entire landscape of practical techniques, real-world use-cases, and top takeaways for practitioners based on research findings.
Amazon Verified review Amazon
H2N Dec 14, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Machine Learning for Imbalanced Data is a helpful guide to deal with imbalanced data in machine learning. The authors talked about various strategies and best practices to address the complexity of data imbalance, underscoring the importance of context. Lots of techniques were covered in the book such as oversampling methods to deep learning approaches with real-world applications. A nice book for anyone to learn and work in machine learning.
Amazon Verified review Amazon
Ashish Tiwari Dec 14, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
"Machine Learning for Imbalanced Data" is an insightful, 300+ page journey into the complexities of machine learning, especially tailored for those with some prior experience. It's a well-crafted guide that demystifies topics like oversampling, undersampling, deep learning techniques, and model calibration with rich details. The book excels in blending theoretical concepts with practical Python code examples, making it a valuable reference for real-world applications. Its approachable style, coupled with comprehensive content, makes it an indispensable resource for anyone looking to master the intricacies of machine learning in the context of imbalanced data.
Amazon Verified review Amazon
Snigdha Dec 31, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book provided a great overview in a concise and clear format of dealing with imbalanced datasets and what techniques to use. The text contains helpful examples and insights from the author's industry experience. I enjoyed the cartoon strips added in chapters for easy understanding. The collab notebooks provided in the GitHub repo provide the coding practice needed to utilize the theory in the book. I would recommend this to anyone learning more about machine learning as most of the datasets in real life are imbalanced.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.