Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning for Imbalanced Data

You're reading from   Machine Learning for Imbalanced Data Tackle imbalanced datasets using machine learning and deep learning techniques

Arrow left icon
Product type Paperback
Published in Nov 2023
Publisher Packt
ISBN-13 9781801070836
Length 344 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Dr. Mounir Abdelaziz Dr. Mounir Abdelaziz
Author Profile Icon Dr. Mounir Abdelaziz
Dr. Mounir Abdelaziz
Kumar Abhishek Kumar Abhishek
Author Profile Icon Kumar Abhishek
Kumar Abhishek
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Chapter 1: Introduction to Data Imbalance in Machine Learning FREE CHAPTER 2. Chapter 2: Oversampling Methods 3. Chapter 3: Undersampling Methods 4. Chapter 4: Ensemble Methods 5. Chapter 5: Cost-Sensitive Learning 6. Chapter 6: Data Imbalance in Deep Learning 7. Chapter 7: Data-Level Deep Learning Methods 8. Chapter 8: Algorithm-Level Deep Learning Techniques 9. Chapter 9: Hybrid Deep Learning Methods 10. Chapter 10: Model Calibration 11. Assessments 12. Index 13. Other Books You May Enjoy Appendix: Machine Learning Pipeline in Production

Machine learning 101

Let’s do a quick overview of machine learning and its related fields:

  • Artificial intelligence is the superset of all intelligence-related problems. Classical machine learning encompasses problems that can be solved by training traditional classical models (such as decision trees or logistic regression) and predicting the target values. They typically work on tabular data, require extensive feature engineering (manual development of features), and are less effective on text and image data. Deep learning tends to do far better on image, text, speech, and video data, wherein, typically, no manual feature engineering is needed, and various layers in the neural network automatically do feature engineering for us.
  • In supervised learning, we have both inputs and outputs (labels) in the dataset, and the model learns to predict the output during the training. Each input can be represented as a list of features. The output or labels can be a finite set of classes (classification), a real number (regression), or something more complex. A classic example of supervised learning in classification is the Iris flowers classification. In this case, the dataset includes features such as petal length, petal width, sepal length, and sepal width, and the labels are the species of the Iris flowers (setosa, versicolor, or virginica). A model can be trained on this dataset and then be used to classify new, unseen Iris flowers as one of these species.
  • In unsupervised learning, models either don’t have access to the labels or don’t use the labels and then try to make some predictions – for example, clustering the examples in the dataset into different groups.
  • In reinforcement learning, the model tries to learn by making mistakes and optimizing a goal or profit variable. An example would be training a model to play chess and adjusting its strategy based on feedback received through rewards and penalties.

In supervised learning (which is the focus of this book), there are two main types of problems: classification and regression. Classification problems involve categorizing data into predefined classes or labels, such as “fraud” or “non-fraud” and “spam” or “non-spam.” On the other hand, regression problems aim to predict a continuous variable, such as the price of a house.

While data imbalance can also affect regression problems, this book will concentrate solely on classification problems. This focus is due to several factors, such as the limited scope of this book and the well-established techniques available for classification. In some cases, you might even be able to reframe a regression problem as a classification problem, making the methods discussed in this book still relevant.

When it comes to various kinds of models that are popular for classification problems, we have quite a few categories of classical supervised machine learning models:

  • Logistic regression: This is a supervised machine learning algorithm that’s used for binary classification problems. It predicts the probability of a binary target variable based on a set of predictor variables (features) by fitting a logistic function to the data, which outputs a value between 0 and 1.
  • Support Vector Machines (SVMs): These are supervised machine learning algorithms that are mainly used for classification and can be extended to regression problems. SVMs classify data by finding the optimal hyperplane that maximally separates the different classes in the input data, thus making it a powerful tool for binary and multiclass classification tasks.
  • K-Nearest Neighbors (KNN): This is a supervised machine learning algorithm that’s used for classification and regression analysis. It predicts the target variable based on the k-nearest neighbors in the training dataset. The value of k determines the number of neighbors to consider when making a prediction, and it can be tuned to optimize the model’s performance.
  • Tree models: These are a type of supervised machine learning algorithm that’s used for classification and regression analysis. They recursively split the data into smaller subsets based on the most important features to create a decision tree that predicts the target variable based on the input features.
  • Ensemble models: These combine multiple individual models to improve predictive accuracy and reduce overfitting (explained later in this chapter). Ensemble techniques include bagging (for example, random forest), boosting (for example, XGBoost), and stacking. They are commonly used for classification as well as regression analysis.
  • Neural networks: These models are inspired by the human brain, consist of multiple layers with numerous neurons in each, and are capable of learning complex functions. We will discuss these in more detail in Chapter 6, Data Imbalance in Deep Learning.

Figure 1.4 displays the decision boundaries of various classifiers we have reviewed so far. It shows that logistic regression has a linear decision boundary, while tree-based models such as decision trees, random forests, and XGBoost work by dividing examples into axis-parallel rectangles to form their decision boundary. SVM, on the other hand, transforms the data to a different space so that it can plot its non-linear decision boundary. Neural networks have a non-linear decision boundary:

Figure 1.4 – The decision boundaries of popular machine learning algorithms on an imbalanced dataset

Next, we’ll delve into the principles underlying the process of model training.

What happens during model training?

In the training phase of a machine learning model, we provide a dataset consisting of examples, each with input features and a corresponding label, to the model. Let <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>X</mi></mrow></math> represent the list of features used for training, and <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>y</mi></mrow></math> be the list of labels in the training dataset. The goal of the model is to learn a function, <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>f</mi></mrow></math>, such that <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>f</mi><mo>(</mo><mi>X</mi><mo>)</mo><mo>≈</mo><mi>y</mi></mrow></mrow></math>.

The model has adjustable parameters, denoted as θ, which are fine-tuned during the training process. The error function, commonly referred to as the loss function, is defined as <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>L</mi><mo>(</mo><mi>f</mi><mo>(</mo><mi>X</mi><mo>;</mo><mi>θ</mi><mo>)</mo><mo>,</mo><mi>y</mi><mo>)</mo></mrow></mrow></math>. This error function needs to be minimized by a learning algorithm, which finds the optimal setting of these parameters, <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>θ</mi></mrow></math>.

In classification problems, our typical loss functions are cross-entropy loss (also called the log loss):

<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>C</mi><mi>r</mi><mi>o</mi><mi>s</mi><mi>s</mi><mi>E</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>o</mi><mi>p</mi><mi>y</mi><mi>L</mi><mi>o</mi><mi>s</mi><mi>s</mi><mo>(</mo><mi>p</mi><mo>)</mo><mo>=</mo><mo>{</mo><mo>−</mo><mo mathvariant="italic">log</mo><mo>(</mo><mi>p</mi><mo>)</mo><mi>i</mi><mi>f</mi><mi>y</mi><mo>=</mo><mn>1</mn><mo>−</mo><mo mathvariant="italic">log</mo><mo>(</mo><mn>1</mn><mo>−</mo><mi>p</mi><mo>)</mo><mi>o</mi><mi>t</mi><mi>h</mi><mi>e</mi><mi>r</mi><mi>w</mi><mi>i</mi><mi>s</mi></mrow></mrow></math>

Here, p is the predicted probability from the model when <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>y</mi><mo>=</mo><mn>1</mn></mrow></mrow></math>.

When the model’s prediction closely agrees with the target label, the loss function will approach zero. However, when the prediction deviates significantly from the target, the loss can become arbitrarily large, indicating a poor model fit.

As training progresses, the training loss keeps going down (Figure 1.5):

Figure 1.5 – Rate of change of the loss function as training progresses

This brings us to the concept of the fit of a model:

  • A model is said to underfit if it is too simple and can’t capture the data’s complexity. It performs poorly on both training and new data.
  • A model is of right fit if it accurately captures data patterns without learning noise. It performs well on both training and new data.
  • An overfit model is too complex and learns noise along with data patterns. It performs well on training data but poorly on new data (Figure 1.6):

Figure 1.6 – Underfit, right fit, and overfit models for classification task

Next, let’s briefly try to learn about two important concepts in machine learning:

  • Regularization is a set of techniques that are used to prevent the overfitting of a model to the training data. One type of regularization (namely L1 or L2) adds a penalty term to the loss function, which encourages the model to have smaller weights and reduces its complexity. This helps prevent the model from fitting too closely to the training data and generalizes better to unseen data.
  • Feature engineering is the process of selecting and transforming the input features of a model to improve its performance. Feature engineering involves selecting the most relevant features for the problem, transforming them to make them more informative, and creating new features from the existing ones. Good feature engineering can make a huge difference in the performance of a model and can often be more important than the choice of algorithm or hyperparameters.
You have been reading a chapter from
Machine Learning for Imbalanced Data
Published in: Nov 2023
Publisher: Packt
ISBN-13: 9781801070836
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image