A soft introduction to machine learning
ML is about using a set of statistical and mathematical algorithms to perform tasks such as concept learning, predictive modeling, clustering, and mining useful patterns. The ultimate goal is to improve the learning in such a way that it becomes automatic, so that no more human interactions are needed, or at least to reduce the level of human interaction as much as possible.
We now refer to a famous definition of ML by Tom M. Mitchell (Machine Learning, Tom Mitchell, McGraw Hill), where he explained what learning really means from a computer science perspective:
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
Based on this definition, we can conclude that a computer program or machine can do the following:
- Learn from data and histories called training data
- Improve with experience
- Interactively enhance a model that can be used to predict outcomes of questions
Almost every machine-learning algorithm we use can be treated as an optimization problem. This is about finding parameters that minimize some objective function, such as a weighted sum of two terms such as a cost function and regularization (log-likelihood and log-prior, respectively, in statistics).
Typically, an objective function has two components: a regularizer, which controls the complexity of the model, and the loss, which measures the error of the model on the training data (we’ll look into the details).
On the other hand, the regularization parameter defines the trade-off between the two goals of minimizing the loss of the training error and of minimizing the model's complexity in an effort to avoid overfitting. Now if both of these components are convex, then their sum is also convex; else it is nonconvex.
Note
In machine learning, overfitting is when the predictor model fits perfectly on the training examples, but does badly on the test examples. This often happens when the model is too complex and trivially fits the data (too many parameters), or when there is not enough data to accurately estimate the parameters. When the ratio of model complexity to training set size is too high, overfitting will typically occur.
More elaborately, while using an ML algorithm, our goal is to obtain the hyperparameters of a function that returns the minimum error when making predictions. The error loss function has a typically U-shaped curve, when visualized on a two-dimensional plane, and there exists a point, which gives the minimum error.
Therefore, using a convex optimization technique, we can minimize the function until it converges toward the minimum error (that is, it tries to reach the middle region of the curve), which represents the minimum error. Now that a problem is convex, it is usually easier to analyze the asymptotic behavior of the algorithm that shows how fast it converges as the model observes more and more training data.
The challenge of ML is to allow a computer to learn how to automatically recognize complex patterns and make decisions as intelligently as possible. The entire learning process requires a dataset, as follows:
- Training set: This is the knowledge base used to fit the parameters of the machine-learning algorithm. During this phase, we would use the training set to find the optimal weights, with the back-prop rule, and all the parameters to set before the learning process begins (hyperparameters).
- Validation set: This is a set of examples used to tune the parameters of an ML model. For example, we would use the validation set to find the optimal number of hidden units, or determine a stopping point for the back-propagation algorithm. Some ML practitioners refer to it as development set or dev set.
- Test set: This is used for evaluating the performance of the model on unseen data, which is called model inferencing. After assessing the final model on the test set, we don't have to tune the model any further.
Learning theory uses mathematical tools that derive from probability theory and information theory. Three learning paradigms will be briefly discussed:
- Supervised learning
- Unsupervised learning
- Reinforcement learning
The following diagram summarizes the three types of learning, along with the problems they address:
Supervised learning is the simplest and most well-known automatic learning task. It is based on a number of pre-defined examples, in which the category where each of the inputs should belong is already known. In this case, the crucial issue is the problem of generalization. After the analysis of a typical small sample of examples, the system should produce a model that should work well for all possible inputs.
The following figure shows a typical workflow of supervised learning. An actor (for example, an ML practitioner, data scientist, data engineer, or ML engineer) performs ETL (Extraction, Transformation, and Load) and necessary feature engineering (including feature extraction, selection) to get the appropriate data, with features and labels.
Then he does the following:
- Splits the data into the training, development, and test set
- Uses the training set to train an ML model
- Uses the validation set for validating the training against the overfitting problem, and regularization
- Evaluates the model's performance on the test set (that is, unseen data)
- If the performance is not satisfactory, he performs additional tuning to get the best model, based on hyperparameter optimization
- Finally, he deploys the best model into a production-ready environment
In the overall lifecycle, there might be many actors involved (for example, data engineer, data scientist, or ML engineer) to perform each step independently or collaboratively:
In supervised ML, the set consists of labeled data, that is, objects and their associated values for regression. This set of labeled examples, therefore, constitutes the training set. Most supervised learning algorithms share one characteristic: the training is performed by the minimization of a particular loss or cost function, representing the output error provided by the system, with respect to the desired output.
The supervised learning context includes classification and regression tasks: classification is used to predict which class a data point is a part of (discrete value) while regression is used to predict continuous values:
In other words, the classification task predicts the label of the class attribute, while the regression task makes a numeric prediction of the class attribute.
In the context of supervised learning, unbalanced data refers to classification problems where we have unequal instances for different classes. For example, if we have a classification task for only two classes, balanced data would mean 50% preclassified examples for each of the classes.
If the input dataset is a little unbalanced (for example, 60% for one class and 40% for the other class) the learning process will be required to randomly split the input dataset into three sets, with 50% for the training set, 20% for the validation set, and the remaining 30% for the testing set.
In unsupervised learning, an input set is supplied to the system during the training phase. In contrast with supervised learning, the input objects are not labeled with their class. This type of learning is important because, in the human brain, it is probably far more common than supervised learning.
For the classification, we assume that we are given a training dataset of correctly labeled data. Unfortunately, we do not always have that luxury when we collect data in the real world. The only object in the domain of learning models, in this case, is the observed data input, which is often assumed to be independent samples of an unknown underlying probability distribution.
For example, suppose that you have a large collection of non-pirated and totally legal MP3s in a crowded and massive folder on your hard drive. How could you possibly group together songs without direct access to their metadata? One possible approach could be a mixture of various ML techniques, but clustering is often at the heart of the solution.
Now, what if you could build a clustering predictive model that could automatically group together similar songs, and organize them into your favorite categories such as "country", "rap" and "rock"? The MP3 would be added to the respective playlist in an unsupervised way. In short, unsupervised learning algorithms are commonly used in clustering problems:
See the preceding diagram to get an idea of a clustering technique being applied to solve this kind of problem. Although the data points are not labeled, we can still do the necessary feature engineering, and group a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other, than to those in other groups (clusters).
This is not easy for a human, because a standard approach is to define a similarity measure between two objects and then look for any cluster of objects that are more similar to each other than they are to the objects in the other clusters. Once we do the clustering, the validation of data points (that is, MP3 files) is completed and we know the pattern of the data (that is, what type of MP3 files fall in to which group).
Reinforcement learning is an artificial intelligence approach that focuses on the learning of the system through its interactions with the environment. With reinforcement learning, the system adapts its parameters based on feedback received from the environment, which then provides feedback on the decisions made. The following diagram shows a person making decisions in order to arrive at their destination. Suppose that, on your drive from home to work, you always choose the same route. However, one day your curiosity takes over and you decide to try a different route, in the hope of finding a shorter commute. This dilemma of trying out new routes, or sticking to the best-known route, is an example of exploration versus exploitation:
Another example is a system that models a chess player, that uses the result of its preceding moves to improve its performance. This is a system that learns with reinforcement.
Current research on reinforcement learning is highly interdisciplinary, including researchers specializing in genetic algorithms, neural networks, psychology, and control engineering.
Simple ML methods that were used in the normal size data analysis are not effective anymore, and should be substituted for more robust ML methods. Although classical ML techniques allow researchers to identify groups, or clusters, of related variables, the accuracy and effectiveness of these methods diminishes with large and high-dimensional datasets.
Therefore, here comes DL, which is one of the most important developments in artificial intelligence in the last few years. DL is a branch of ML based on a set of algorithms that attempt to model high-level abstractions in data.
The development of DL occurred in parallel with the study of artificial intelligence, and especially with the study of neural networks. It was mainly in the 1980s that this area grew, thanks largely to Geoff Hinton and the ML specialists who collaborated with him. At that time, computer technology was not sufficiently advanced to allow a real improvement in this direction, so we had to wait for a greater availability of data and vastly improved computing power to see significant developments.
In short, DL algorithms are a set of Artificial Neural Networks (ANNs), which we will explore later, that can make better representations of large-scale datasets, in order to build models that learn these representations extensively. In this regard, Ian Goodfellow and others defined DL as follows:
"Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and more abstract representations computed in terms of less abstract ones".
Let's give an example. Suppose we want to develop a predictive analytics model, such as an animal recognizer, where our system has to resolve two problems:
- Classify if an image represents a cat or a dog
- Cluster dog and cat images
If we solve the first problem using a typical ML method, we must define the facial features (ears, eyes, whiskers, and so on), and write a method to identify which features (typically non-linear) are more important when classifying a particular animal.
However, at the same time, we cannot address the second problem, because classical ML algorithms for clustering images (such as K-means) cannot handle non-linear features.
DL algorithms will take these two problems one step further and the most important features will be extracted automatically, after determining which features are the most important for classification or clustering. In contrast, using a classic ML algorithm, we would have to manually provide the features.
In summary, the DL workflow would be as follows:
- A DL algorithm would first identify the edges that are most relevant when clustering cats or dogs
- It would then build on this hierarchically to find the various combinations of shapes and edges
- After consecutive hierarchical identification of complex concepts and features, it decides which of these features can be used to classify the animal, then takes out the label column and performs unsupervised training using an autoencoder, before doing the clustering.
Up to this point, we have seen that DL systems are able to recognize what an image represents. A computer does not see an image as we see it because it only knows the position of each pixel and its color. Using DL techniques, the image is divided into various layers of analysis. At a lower level, the software analyzes, for example, a grid of a few pixels, with the task of detecting a type of color or various nuances. If it finds something, it informs the next level, which at this point verifies whether that given color belongs to a larger form, such as a line.
The process continues to the upper levels until you understand what is shown in the image. Software capable of doing these things is now widespread and is found in systems for recognizing faces or searching for an image on Google, for example. In many cases, these are hybrid systems, that work with more traditional IT solutions, that are mixed with generation artificial intelligence.
The following diagram shows what we have discussed in the case of an image classification system. Each block gradually extracts the features of the input image and goes on to process data from the previous blocks, that have already been processed, extracting increasingly abstract features of the image, and thus building the hierarchical representation of data that comes with a DL-based system.
More precisely, it builds the layers as follows:
- Layer 1: The system starts identifying the dark and light pixels
- Layer 2: The system identifies edges and shapes
- Layer 3: The system learns more complex shapes and objects
- Layer 4: The system learns which objects define a human face
This is shown in the following diagram:
In the previous section, we saw that using a linear ML algorithm, we typically handle only a few hyperparameters.
However, when neural networks come in the party, things become too complex. In each layer, there are so many hyperparameters, and the cost function always becomes nonconvex.
Another reason is that the activations functions used in the hidden layers are nonlinear, so the cost is nonconvex. We’ll discuss this phenomenon in more detail in the later chapters.