You're reading from Machine Learning for Imbalanced Data Tackle imbalanced datasets using machine learning and deep learning techniques

Product type Paperback

Published in Nov 2023

Publisher Packt

ISBN-13 9781801070836

Length 344 pages

Edition 1st Edition

Languages

Rust

Tools

TensorFlow Lite

Concepts

Data Science

Authors (2):

Dr. Mounir Abdelaziz

Kumar Abhishek

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: Introduction to Data Imbalance in Machine Learning FREE CHAPTER

2. Chapter 2: Oversampling Methods

3. Chapter 3: Undersampling Methods

4. Chapter 4: Ensemble Methods

5. Chapter 5: Cost-Sensitive Learning

6. Chapter 6: Data Imbalance in Deep Learning

7. Chapter 7: Data-Level Deep Learning Methods

8. Chapter 8: Algorithm-Level Deep Learning Techniques

9. Chapter 9: Hybrid Deep Learning Methods

10. Chapter 10: Model Calibration

11. Assessments

12. Index

Why subscribe?

13. Other Books You May Enjoy

Appendix: Machine Learning Pipeline in Production

Introduction to imbalanced datasets

Machine learning algorithms learn from collections of examples that we call datasets. These datasets contain multiple data samples or points, which we may refer to as examples, samples, or instances interchangeably throughout this book.

A dataset can be said to have a balanced distribution when all the target classes have a similar number of examples, as shown in Figure 1.1:

Figure 1.1 – Balanced distribution with an almost equal number of examples for each class

Imbalanced datasets or skewed datasets are those that have some target classes (also called labels) that outnumber the rest of the classes (Figure 1.2). Though this generally applies to classification problems (for example, fraud detection) in machine learning, they inevitably occur in regression problems (for example, house price prediction) too:

Figure 1.2 – An imbalanced dataset with five classes and a varying number of samples

We label the class with more instances as the “majority” or “negative” class and the one with fewer instances as the “minority” or “positive” class. Most of the time, our main interest lies in the minority class, which is why we often refer to the minority class as the “positive” class and to the majority class as the “negative” class:

Figure 1.3 – A visual guide to common terminology used in imbalanced classification

This can be scaled to more than two classes, and such classification problems are called multi-class classification. In the first half of this book, we will focus our attention only on binary class classification to keep the material easier to grasp. It’s relatively easy to extend the concepts to multi-class classification.

Let’s look at a few examples of imbalanced datasets:

Fraud detection is where fraudulent transactions need to be detected out of several transactions. This problem is often encountered and widely used in finance, healthcare, and e-commerce industries.
Network intrusion detection using machine learning involves analyzing large volumes of network traffic data to detect and prevent instances of unauthorized access and misuse of computer systems.
Cancer detection. Cancer is not rare, but we still may want to use machine learning to analyze medical data to identify potential cases of cancer earlier and improve treatment outcomes.

In this book, we would like to focus on the class imbalance problem in general and look at various solutions where we see that class imbalance is affecting the performance of our model. A typical problem is that models perform quite poorly on the minority classes for which the model has seen a very low number of examples during model training.

You're reading from Machine Learning for Imbalanced Data Tackle imbalanced datasets using machine learning and deep learning techniques

Table of Contents (15) Chapters

Introduction to imbalanced datasets

Authors (2)

Personalised recommendations for you