Machine learning 101
Let’s do a quick overview of machine learning and its related fields:
- Artificial intelligence is the superset of all intelligence-related problems. Classical machine learning encompasses problems that can be solved by training traditional classical models (such as decision trees or logistic regression) and predicting the target values. They typically work on tabular data, require extensive feature engineering (manual development of features), and are less effective on text and image data. Deep learning tends to do far better on image, text, speech, and video data, wherein, typically, no manual feature engineering is needed, and various layers in the neural network automatically do feature engineering for us.
- In supervised learning, we have both inputs and outputs (labels) in the dataset, and the model learns to predict the output during the training. Each input can be represented as a list of features. The output or labels can be a finite set of classes (classification), a real number (regression), or something more complex. A classic example of supervised learning in classification is the Iris flowers classification. In this case, the dataset includes features such as petal length, petal width, sepal length, and sepal width, and the labels are the species of the Iris flowers (setosa, versicolor, or virginica). A model can be trained on this dataset and then be used to classify new, unseen Iris flowers as one of these species.
- In unsupervised learning, models either don’t have access to the labels or don’t use the labels and then try to make some predictions – for example, clustering the examples in the dataset into different groups.
- In reinforcement learning, the model tries to learn by making mistakes and optimizing a goal or profit variable. An example would be training a model to play chess and adjusting its strategy based on feedback received through rewards and penalties.
In supervised learning (which is the focus of this book), there are two main types of problems: classification and regression. Classification problems involve categorizing data into predefined classes or labels, such as “fraud” or “non-fraud” and “spam” or “non-spam.” On the other hand, regression problems aim to predict a continuous variable, such as the price of a house.
While data imbalance can also affect regression problems, this book will concentrate solely on classification problems. This focus is due to several factors, such as the limited scope of this book and the well-established techniques available for classification. In some cases, you might even be able to reframe a regression problem as a classification problem, making the methods discussed in this book still relevant.
When it comes to various kinds of models that are popular for classification problems, we have quite a few categories of classical supervised machine learning models:
- Logistic regression: This is a supervised machine learning algorithm that’s used for binary classification problems. It predicts the probability of a binary target variable based on a set of predictor variables (features) by fitting a logistic function to the data, which outputs a value between 0 and 1.
- Support Vector Machines (SVMs): These are supervised machine learning algorithms that are mainly used for classification and can be extended to regression problems. SVMs classify data by finding the optimal hyperplane that maximally separates the different classes in the input data, thus making it a powerful tool for binary and multiclass classification tasks.
- K-Nearest Neighbors (KNN): This is a supervised machine learning algorithm that’s used for classification and regression analysis. It predicts the target variable based on the k-nearest neighbors in the training dataset. The value of k determines the number of neighbors to consider when making a prediction, and it can be tuned to optimize the model’s performance.
- Tree models: These are a type of supervised machine learning algorithm that’s used for classification and regression analysis. They recursively split the data into smaller subsets based on the most important features to create a decision tree that predicts the target variable based on the input features.
- Ensemble models: These combine multiple individual models to improve predictive accuracy and reduce overfitting (explained later in this chapter). Ensemble techniques include bagging (for example, random forest), boosting (for example, XGBoost), and stacking. They are commonly used for classification as well as regression analysis.
- Neural networks: These models are inspired by the human brain, consist of multiple layers with numerous neurons in each, and are capable of learning complex functions. We will discuss these in more detail in Chapter 6, Data Imbalance in Deep Learning.
Figure 1.4 displays the decision boundaries of various classifiers we have reviewed so far. It shows that logistic regression has a linear decision boundary, while tree-based models such as decision trees, random forests, and XGBoost work by dividing examples into axis-parallel rectangles to form their decision boundary. SVM, on the other hand, transforms the data to a different space so that it can plot its non-linear decision boundary. Neural networks have a non-linear decision boundary:
Figure 1.4 – The decision boundaries of popular machine learning algorithms on an imbalanced dataset
Next, we’ll delve into the principles underlying the process of model training.
What happens during model training?
In the training phase of a machine learning model, we provide a dataset consisting of examples, each with input features and a corresponding label, to the model. Let represent the list of features used for training, and be the list of labels in the training dataset. The goal of the model is to learn a function, , such that .
The model has adjustable parameters, denoted as θ, which are fine-tuned during the training process. The error function, commonly referred to as the loss function, is defined as . This error function needs to be minimized by a learning algorithm, which finds the optimal setting of these parameters, .
In classification problems, our typical loss functions are cross-entropy loss (also called the log loss):
Here, p is the predicted probability from the model when .
When the model’s prediction closely agrees with the target label, the loss function will approach zero. However, when the prediction deviates significantly from the target, the loss can become arbitrarily large, indicating a poor model fit.
As training progresses, the training loss keeps going down (Figure 1.5):
Figure 1.5 – Rate of change of the loss function as training progresses
This brings us to the concept of the fit of a model:
- A model is said to underfit if it is too simple and can’t capture the data’s complexity. It performs poorly on both training and new data.
- A model is of right fit if it accurately captures data patterns without learning noise. It performs well on both training and new data.
- An overfit model is too complex and learns noise along with data patterns. It performs well on training data but poorly on new data (Figure 1.6):
Figure 1.6 – Underfit, right fit, and overfit models for classification task
Next, let’s briefly try to learn about two important concepts in machine learning:
- Regularization is a set of techniques that are used to prevent the overfitting of a model to the training data. One type of regularization (namely L1 or L2) adds a penalty term to the loss function, which encourages the model to have smaller weights and reduces its complexity. This helps prevent the model from fitting too closely to the training data and generalizes better to unseen data.
- Feature engineering is the process of selecting and transforming the input features of a model to improve its performance. Feature engineering involves selecting the most relevant features for the problem, transforming them to make them more informative, and creating new features from the existing ones. Good feature engineering can make a huge difference in the performance of a model and can often be more important than the choice of algorithm or hyperparameters.