In this section, we take the popular ML terms and review them. This non-exhaustive review will helps us as a quick refresher and enable us to follow the projects covered by this book without any hiccups.
ML terminology – a quick review
Deep learning
This is a revolutionary trend and has become a super-hot topic in recent times in the ML world. It is a category of ML algorithms that use artificial neural networks (ANNs) with multiple hidden layers of neurons to address problems.
Superior results are obtained by applying deep learning to several real-world problems. Convolutional neural networks (CNNs), recurrent neural networks (RNNs) autoencoders (AEs), generative adversarial networks (GANs), and deep belief networks (DBNs) are some of the popular deep learning methods.
Big data
The term refers to large volumes of data that combine both structured data types (rows and columns similar to a table) and unstructured data types (text documents, voice recordings, image data, and so on). Due to the volume of data, it does not fit into the main memory of the hardware where ML algorithms need to be executed. Separate strategies are needed to work on these large volumes of data. Distributed processing of the data and combining the results (typically called MapReduce) is one strategy. It is also possible to process just enough data sequentially that can fit in a main memory each time and store the results somewhere on a hard drive; we need to repeat this process until the entirety of the data is processed completely. After the data processing, the results need to be combined to avail the final results of all the data that has been processed.
Special technologies such as Hadoop and Spark are required to perform ML on big data. Needless to say, you will need to hone specialized skills in order to apply ML algorithms successfully using these technologies on big data.
Natural language processing
This is an application area of ML that aims for computers to comprehend human languages such as English, French, and Mandarin. NLP applications enable users to interact with computers using spoken languages.
Chatbot, speech synthesis, machine translation, text classification and clustering, text generation, and text summarization are some of the popular applications of NLP.
Computer vision
This field of ML tries to mimic human vision. The aim is to enable computers to see, process, and determine the objects in images or videos. Deep learning and the availability of powerful hardware has led to the rise of very powerful applications in this area of ML.
Autonomous vehicles such as self-driving cars, object recognition, object tracking, motion analysis, and the restoration of images are some of the applications of computer vision.
Cost function
Cost function, loss function, or error function are used interchangeably by practitioners. Each is used to define and measure the error of a model. The objective for the ML algorithm is to minimize the loss from the dataset.
Some of the examples of cost function are square loss that is used in linear regression, hinge loss that is used in support vector machines and 0/1 loss used to measure accuracy in classification algorithms.
Model accuracy
Accuracy is one of the popular metrics used to measure the performance of ML models. The measurement is easy to understand and helps the practitioner to communicate the goodness of a model very easily to its business users.
Generally, this metric is used for classification problems. Accuracy is measured as the number of correct predictions divided by the total number of predictions.
Confusion matrix
This is a table that describes the classification model's performance. It is an n rows, n columns matrix where n represents the number of classes that are predicted by the classification model. It is formed by noting down the number of correct and incorrect predictions by the model when compared to the actual label.
Confusion matrices are better explained with an example—assume that there are 100 images in a dataset where there are 50 dog images and 50 cat images. A model that is built to classify images as cat images or dog images is given this dataset. The output from the model showed that 40 dog images are classified correctly and 20 cat images are predicted correctly. The following table is the confusion matrix construction from the prediction output of the model:
Model predicted labels | Actual labels | ||
cats | dogs | ||
cats | 20 | 30 | |
dogs | 10 | 40 |
Predictor variables
These variables are otherwise called independent variables or x-values. These are the input variables that help to predict the dependent or target or response variable.
In a house rent prediction use case, the size of the house in square feet, the number of bedrooms, the number of houses available unoccupied in the region, the proximity to public transport, the accessibility to facilities such as hospitals and schools are all some examples of predictor variables that determine the rental cost of the house.
Response variable
Dependent variables or target or y-values are all interchangeably used by practitioners as alternatives for the term response variable. This is the variable the model predicts as output based on the independent variables that are provided as input to the model.
In the house rent prediction use case, the rent predicted is the response variable.
Dimensionality reduction
Feature reduction (or feature selection) or dimensionality reduction is the process of reducing the input set of independent variables to obtain a lesser number of variables that are really required by the model to predict the target.
In certain cases, it is possible to represent multiple dependent variables by combining them together without losing much information. For example, instead of having two independent variables such as the length of a rectangle and the breath of a rectangle, the dimensions can be represented by only one variable called the area that represents both the length and breadth of the rectangle.
The following mentioned are the multiple reasons we need to perform a dimensionality reduction on a given input dataset:
- To aid data compression, therefore accommodate the data in a smaller amount of disk space.
- The time to process the data is reduced as fewer dimensions are used to represent the data.
- It removes redundant features from datasets. Redundant features are typically known as multicollinearity in data.
- Reducing the data to fewer dimensions helps visualize the data through graphs and charts.
- Dimensionality reduction removes noisy features from the dataset which, in turn, improves the model performance.
There are many ways by which dimensionality reduction can be attained in a dataset. The use of filters, such as information gain filters, and symmetric attribute evaluation filters, is one way. Genetic-algorithm-based selection and principal component analysis (PCA) are other popular techniques used to achieve dimensionality reduction. Hybrid methods do exist to attain feature selection.
Class imbalance problem
Let's assume that one needs to build a classifier that identifies cat and dog images. The problem has two classes namely cat and dog. If one were to train a classification model, training data is required. The training data in this case is based on images of dogs and cats given as input so a supervised learning model can learn the features of dogs versus cats.
It may so happen that if there are 100 images available for training in the dataset and 95 of them are dog pictures, five of them are cat pictures. This kind of unequal representation of different classes in a training dataset is termed as a class imbalance problem.
Most ML techniques work best when the number of examples in each class are roughly equal. One can employ certain techniques to counter class imbalance problems in data. One technique is to reduce the majority class (images of dogs) samples and make them equal to the minority class (images of cats). In this case, there is information loss as a lot of the dog images go unused. Another option is to generate synthetic data similar to the data for the minority class (images of cats) so as to make the number of data samples equal to the majority class. Synthetic minority over-sampling technique (SMOTE) is a very popular technique for generating synthetic data.
It may be noted that accuracy is not a good metric for evaluating the performance of models where the training dataset experiences class imbalance problems. Assume a model built based on a class-imbalanced dataset that predicts a majority class for any test sample that it is asked to predict on. In this case, one gets 95% accuracy as roughly 95% of the images are dog images in the test dataset. But this performance can only be termed as a hoax as the model does not have any discriminative power—it just predicts dog as the class for any image it needs to predict about. In this case, it just happened that every image is predicted as a dog, but still the model got away with a very high accuracy indicating that it is a great model, whether it is in reality or not!
There are several other performance metrics available to use in a situation where a class imbalance is a problem, F1 score and the area under the curve of the receiver operating characteristic (AUCROC) are some of the popular ones.
Model bias and variance
While several ML algorithms are available to build models, model selection can be done on the basis of the bias and variance errors that the models produce.
Bias error occurs when the model has a limited capability to learn the true signals from a dataset provided as input to it. Having a highly biased model essentially means the model is consistent but inaccurate on average.
Variance errors occur when the models are too sensitive to the training datasets with which they are trained. Having high variance in a model essentially means that the trained model will produce high accuracies on any test dataset on average, but their predictions are inconsistent.
Underfitting and overfitting
Underfitting and overfitting are the concepts closely associated with bias and variance. These two are the biggest causes for the poor performance of the models, therefore a practitioner has to pay very close attention to these issues while building ML models.
A situation where the model does not perform well with both training data as well as test data is termed as underfitting. This situation can be detected by observing high training errors and test errors. Having an underfitting problem means that the ML algorithm chosen to fit the model is not suitable to model the features of the training data. Therefore, the only remedy is to try other kinds of ML algorithms to model the data.
Overfitting is a situation where the model learned the features of the training data so well that it fails to generalize on other unseen data. In an overfitting model, noise or random fluctuations in the training data are considered as true signals by the model and it looks for these patterns in unseen data as well, therefore impacting the poor model performance.
Overfitting is more prevalent in non-parametric and non-linear models such as decision trees, and neural networks. Pruning the trees is one remedy to overcome the problem. Another remedial measure is a technique called dropout where some of the features learned from the model are dropped randomly from the model therefore making the model more generalizable to unseen data. Regularization is yet another technique to resolve overfitting problems. This is attained by penalizing the coefficients of the model so that the model generalizes better. L1 penalty and L2 penalty are the types of penalties through which regularization can be performed in regression scenarios.
The goal for a practitioner is to ensure that the model neither overfits nor underfits. To achieve this, it is essential to learn when to stop training the ML data. One could plot the training error and validation error (an error that is measured on a small portion of the training dataset that is kept aside) on a chart and identify the point where the training data keeps decreasing, however the validation error starts to rise.
At times, obtaining performance measurement on training data and expecting a similar measurement to be obtained on unseen data may not work. A more realistic training and test performance estimate is to be obtained from a model by adopting a data-resampling technique called k-fold cross validation. The k in k-fold cross validation refers to a number; examples include 3-fold cross validation, 5-fold cross validation, and 10-fold cross validation. The k-fold cross validation technique involves dividing the training data into k parts and running the training process k + 1 times. In each iteration, the training is performed on k - 1 partitions of the data and the kth partition is used exclusively for testing. It may be noted that the kth partition for testing and k - 1 partitions for training are shuffled in each iteration, therefore the training data and testing data do not stay constant in each iteration. This approach enables getting a pessimistic measurement of performance that can be expected from the model on the unseen data in the future.
10-fold cross validation with 10 runs to obtain model performance is considered to be a gold standard estimate for a model's performance among practitioners. Estimating the model's performance in this way is always recommended in industrial setups and for critical ML applications.
Data preprocessing
This is essentially a step that is adopted in the early stages of an ML project pipeline. Data preprocessing involves transforming the raw data in a format that is acceptable as input by ML algorithms.
Feature hashing, missing values imputation, transforming variables from numeric to nominal, and vice versa, are a few data preprocessing steps among the numerous things that can be done to data during preprocessing.
Raw text documents' transformation into word vectors is an example of data preprocessing. The word vectors thus obtained can be fed to an ML algorithm to achieve documents classification or documents clustering.
Holdout sample
While working on a training dataset, a small portion of the data is kept aside for testing the performance of the models. The small portion of data is unseen data (not used in training), therefore one can rely on the measurements obtained for this data. The measurements obtained can be used to tune the parameters of the model or just to report out the performance of the model so as to set expectations in terms of what level of performance can be expected from the model.
It may be noted that the performance measurement reported out on the basis of a holdout sample is not as robust an estimate as that of a k-fold cross validation estimate. This is because there could be some unknown biases that could have crept in during the random split of the holdout set from the original dataset. Also, there are also no guarantees that the holdout dataset has a representation of all the classes involved in the training dataset. If we need representation of all classes in the holdout dataset, then a special technique called a stratified holdout sample needs to be applied. This ensures that there is representation for all classes in the holdout dataset. It is obvious that a performance measurement obtained from a stratified holdout sample is a better estimate of performance than that of the estimate of performance obtained from a nonstratified holdout sample.
70%-30%, 80%-20%, and 90%-10% are generally the sets of training data-holdout data splits observed in ML projects.
Hyperparameter tuning
ML or deep learning algorithms take hyperparameters as input prior to training the model. Each algorithm comes with its own set of hyperparameters and some algorithms may have zero hyperparameters.
Hyperparameter tuning is an important step in model building. Each of the ML algorithms comes with some default hyperparameter values that are generally used to build an initial model, unless the practitioner manually overrides the hyperparameters. Setting the right combination of hyperparameters and the right hyperparameter values for the model greatly improves the performance of the model in most cases. Hence, it is strongly recommended that one does hyperparameter tuning as part of ML model building. Searching through the possible universe of hyperparameter values is a very time-consuming task.
The k in k-means clustering and k-nearest neighbors classification, the number of tress and the depth of tress in random forest, and eta in XGBoost are all examples of hyperparameters.
Grid search and Bayesian optimization-based hyperparameter tuning are two popular methods of hyperparameter tuning among practitioners.
Performance metrics
A model needs to be evaluated on unseen data to assess its goodness. The term goodness may be expressed in several ways and these ways are termed as model performance metrics.
Several metrics exist to report the performance of models. Accuracy, precision, recall, F-score, sensitivity, specificity, AUROC curve, root mean squared error (RMSE), Hamming loss, and mean squared error (MSE) are some of the popular model performance metrics among others.
Feature engineering
Feature engineering is the art of creating new features either from existing data in the dataset or by procuring additional data from an external data source. It is done with the intent that adding additional features improves the model performance. Feature engineering generally requires domain expertise and in-depth business problem understanding.
Let's take a look at an example of feature engineering—for a bank that is working on a loan defaulter prediction project, sourcing and supplementing the training dataset with information on the unemployment trends of the region for the past few months might improve the performance of the model.
Model interpretability
Often, in a business environment when ML models are built, just reporting the performance measurements obtained to confirm the goodness of the model may not be enough. The stakeholders generally are inquisitive to understand the whys of the model, that is, what are the factors contributing to the model's performance? In other words, the stakeholders want to understand the causes of the effects. Essentially, the expectation from the stakeholders is to understand the importance of various features in the model and the direction in which each of the variables impacts the model.
For example, does a feature of time spent on exercising every day in the dataset for a cancer prediction model have any impact on the model predictions at all? If so, does time spent on exercising every day push the prediction in a negative direction or positive direction?
While the example might sound simple to generate an answer for, in real-world ML projects, model interpretability is not so very simple due to the complex relationships between variables. It is seldom that one feature, in its isolation, impacts the prediction in any one direction. It is indeed a combination of features that impact the prediction outcome. Thus, it is even more difficult to explain to what extent the feature is impacting the prediction.
Linear models are generally easier to explain even to business users. This is because we obtain weights for various features as a result of model training with linear algorithms. These weights are direct indicators of how a feature is contributing to model prediction. After all, in a linear model, a prediction is the linear combination of model weights and features passed through a function. It should be noted that interaction between variables in the real world are not essentially linear. So, a linear model trying to model the underlying data that has non-linear relationships may not have good predictive power. So, while linear models' interpretability is great, it comes at the cost of model performance.
On the contrary, non-linear and non-parametric models tend to be very difficult to interpret. In most cases, it may not be apparent even to the person building the models as to what exactly are the factors driving the prediction and in which direction. This is simply because the prediction outcome is a complex non-linear combination of variables. It is also known that non-linear models in general are better performing models when compared to linear models. Therefore, there is a trade-off needed between model interpretability and model performance.
While the goal of model interpretability is difficult to achieve, there is some merit in accomplishing this goal. It helps with the retrospection of a model that is deemed as being a good performing model and confirming that no noise inadvertently existed in the data that is used for model building and testing. It is obvious that models with noise as features fail to generalize on unseen data. Model interpretability helps with making sure that no noise crept into the models as features. Also, it helps build trust with business users that are eventually consumers of the model output. After all, there is no point in building a model whose output is not going to be consumed!
Non-parametric, non-linear models are difficult to interpret, if not impossible. Specialized ML methods are now available to aid black box models interpretability. Partial dependency plot (PDP), Locally interpretable model-agnostic explanations (LIME), and Shapley additive explanations (SHAP) also known as Sharpley's are some of the popular methods used by practitioners to decipher the internals of a black box model.
Now that there is a good understanding of the various fundamental terms of ML, our next journey is to explore the details of the ML project pipeline. This journey discussed in the next section helps us understand the process of building an ML project, deploying it, and obtaining predictions to use in a business.