Only learning matters

What does learning exactly mean? Simply, we can say that learning is the ability to change according to external stimuli and remember most of our previous experiences. So, machine learning is an engineering approach that gives maximum importance to every technique that increases or improves the propensity for changing adaptively. A mechanical watch, for example, is an extraordinary artifact, but its structure obeys stationary laws and becomes useless if something external is changed. This ability is peculiar to animals and, in particular, to human beings; according to Darwin’s theory, it's also a key success factor for the survival and evolution of all species. Machines, even if they don't evolve autonomously, seem to obey the same law.

Therefore, the main goal of machine learning is to study, engineer, and improve mathematical models that can be trained (once or continuously) with context-related data (provided by a generic environment) to infer the future and to make decisions without complete knowledge of all influencing elements (external factors). In other words, an agent (which is a software entity that receives information from an environment, picks the best action to reach a specific goal, and observes the results of it) adopts a statistical learning approach, trying to determine the right probability distributions, and use them to compute the action (value or decision) that is most likely to be successful (with the fewest errors).

I do prefer using the term inference instead of prediction, but only to avoid the weird (but not so uncommon) idea that machine learning is a sort of modern magic. Moreover, it's possible to introduce a fundamental statement: an algorithm can extrapolate general laws and learn their structure with relatively high precision, but only if they affect the actual data. So, the term prediction can be freely used, but with the same meaning adopted in physics or system theory. Even in the most complex scenarios, such as image classification with convolutional neural networks, every piece of information (geometry, color, peculiar features, contrast, and so on) is already present in the data and the model has to be flexible enough to extract and learn it permanently.

In the following sections, we will give you a brief description of some common approaches to machine learning. Mathematical models, algorithms, and practical examples will be discussed in later chapters.

Supervised learning

A supervised scenario is characterized by the concept of a teacher or supervisor, whose main task is to provide the agent with a precise measure of its error (directly comparable with output values). With actual algorithms, this function is provided by a training set made up of couples (input and expected output). Starting from this information, the agent can correct its parameters so as to reduce the magnitude of a global loss function. After each iteration, if the algorithm is flexible enough and data elements are coherent, the overall accuracy increases and the difference between the predicted and expected values becomes close to zero. Of course, in a supervised scenario, the goal is training a system that must also work with samples that have never been seen before. So, it's necessary to allow the model to develop a generalization ability and avoid a common problem called overfitting, which causes overlearning due to an excessive capacity (we're going to discuss this in more detail in the following chapters, however, we can say that one of the main effects of such a problem is the ability to only correctly predict the samples used for training, while the error rate for the remaining ones is always very high).

In the following graph, a few training points are marked with circles, and the thin blue line represents a perfect generalization (in this case, the connection is a simple segment):

Example of regression of a stock price with different interpolating curves

Two different models are trained with the same datasets (corresponding to the two larger lines). The former is unacceptable because it cannot generalize and capture the fastest dynamics (in terms of frequency), while the latter seems a very good compromise between the original trend, and has a residual ability to generalize correctly in a predictive analysis.

Formally, the previous example is called regression because it's based on continuous output values. Instead, if there is only a discrete number of possible outcomes (called categories), the process becomes a classification. Sometimes, instead of predicting the actual category, it's better to determine its probability distribution. For example, an algorithm can be trained to recognize a handwritten alphabetical letter, so its output is categorical (in English, there'll be 26 allowed symbols). On the other hand, even for human beings, such a process can lead to more than one probable outcome when the visual representation of a letter isn't clear enough to belong to a single category. This means that the actual output is better described by a discrete probability distribution (for example, with 26 continuous values normalized so that they always sum up to 1).

In the following graph, there's an example of classification of elements with two features. The majority of algorithms try to find the best separating hyperplane (in this case, it's a linear problem) by imposing different conditions. However, the goal is always the same: reducing the number of misclassifications and increasing the noise-robustness. For example, look at the triangular point that is closest to the plane (its coordinates are about [5.1 - 3.0]). If the magnitude of the second feature were affected by noise and so the value were quite smaller than 3.0, a slightly higher hyperplane could wrongly classify it. We're going to discuss some powerful techniques to solve these problems in later chapters:

Example of linear classification

Common supervised learning applications include the following:

Predictive analysis based on regression or categorical classification
Spam detection
Pattern detection
NLP
Sentiment analysis
Automatic image classification
Automatic sequence processing (for example, music or speech)

Unsupervised learning

This approach is based on the absence of any supervisor and therefore of absolute error measures. It's useful when it's necessary to learn how a set of elements can be grouped (clustered) according to their similarity (or distance measure). For example, looking at the previous graph, a human being can immediately identify two sets without considering the colors or the shapes. In fact, the circular dots (as well as the triangular ones) determine a coherent set; it is separate from the other one much more than how its points are internally separated. Using a metaphor, an ideal scenario is a sea with a few islands that can be separated from each other, considering only their mutual position and internal cohesion. Clearly, unsupervised learning provides an implicit descriptive analysis because all the pieces of information discovered by the clustering algorithm can be used to obtain a complete insight of the dataset. In fact, all objects share a subset of features, while they are different under other viewpoints. The aggregation process is also aimed to extend the characteristics of some points to their neighbors, assuming that the similarity is not limited to some specific features. For example, in a recommendation engine, a group of users can be clustered according to the preference expressed for some books. If the chosen criteria detected some analogies between users A and B, we can share the non-overlapping elements between the users. Therefore, if A has read a book that can be suitable for B, we are implicitly authorized to recommend it. In this case, the decision is made by considering a goal (sharing the features) and a descriptive analysis. However, as the model can (and should) manage unknown users too, its purpose is also predictive.

In the following graph, each ellipse represents a cluster and all the points inside its area can be labeled in the same way. There are also boundary points (such as the triangles overlapping the circle area) that need a specific criterion (normally a trade-off distance measure) to determine the corresponding cluster. Just as for classification with ambiguities (P and malformed R), a good clustering approach should consider the presence of outliers and treat them so as to increase both the internal coherence (visually, this means picking a subdivision that maximizes the local density) and the separation among clusters.

For example, it's possible to give priority to the distance between a single point and a centroid, or the average distance among points belonging to the same cluster and different ones. In this graph, all boundary triangles are close to each other, so the nearest neighbor is another triangle. However, in real-life problems, there are often boundary areas where there's a partial overlap, meaning that some points have a high degree of uncertainty due to their feature values:

Example of clustering with a bidimensional dataset split into four natural clusters

Another interpretation can be expressed by using probability distributions. If you look at the ellipses, they represent the area of multivariate Gaussians bound between a minimum and maximum variance. Considering the whole domain, a point (for example, a blue star) could potentially belong to all clusters, but the probability given by the first one (lower-left corner) is the highest, and so this determines the membership. Once the variance and mean (in other words, the shape) of all Gaussians become stable, each boundary point is automatically captured by a single Gaussian distribution (except in the case of equal probabilities). Technically, we say that such an approach maximizes the likelihood of a Gaussian mixture given a certain dataset. This is a very important statistical learning concept that spans many different applications, so it will be examined in more depth in the next chapter, Chapter 2, Important Elements in Machine Learning. Moreover, we're going to discuss some common clustering methodologies, considering both strong and weak points, and compare their performances for various test distributions.

Other important techniques involve the use of both labeled and unlabeled data. This approach is therefore called semi-supervised and can be adopted when it's necessary to categorize a large amount of data with a few complete (labeled) examples or when there's the need to impose some constraints to a clustering algorithm (for example, assigning some elements to a specific cluster or excluding others).

Commons unsupervised applications include the following:

Object segmentation (for example, users, products, movies, songs, and so on)
Similarity detection
Automatic labeling
Recommendation engines

Semi-supervised learning

There are many problems where the amount of labeled samples is very small compared with the potential number of elements. A direct supervised approach is infeasible because the data used to train the model couldn't be representative of the whole distribution, so therefore it's necessary to find a trade-off between a supervised and an unsupervised strategy. Semi-supervised learning has been mainly studied in order to solve these kinds of problems. The topic is a little bit more advanced and won't be covered in this book (the reader who is interested can check out Mastering Machine Learning Algorithms, Bonaccorso G., Packt Publishing). However, the main goals that a semi-supervised learning approach pursues are as follows:

The propagation of labels to unlabeled samples considering the graph of the whole dataset. The samples with labels become attractors that extend their influence to the neighbors until an equilibrium point is reached.
Performing a classification training model (in general, Support Vector Machines (SVM); see Chapter 7, Support Vector Machines, for further information) using the labeled samples to enforce the conditions necessary for a good separation while trying to exploit the unlabeled samples as balancers, whose influence must be mediated by the labeled ones. Semi-supervised SVMs can perform extremely well when the dataset contains only a few labeled samples and dramatically reduce the burden of building and managing very large datasets.
Non-linear dimensionality reduction considering the graph structure of the dataset. This is one of most challenging problems due to the constraints existing in high-dimensional datasets (that is, images). Finding a low-dimensional distribution that represents the original one minimizing the discrepancy is a fundamental task necessary to visualize structures with more than three dimensions. Moreover, the ability to reduce the dimensionality without a significant information loss is a key element whenever it's necessary to work with simpler models. In this book, we are going to discuss some common linear techniques (such as Principal Component Analysis (PCA) that the reader will be able to understand when some features can be removed without impacting the final accuracy but with a training speed gain.

It should now be clear that semi-supervised learning exploits the ability of finding out separating hyperplanes (classification) together with the auto-discovery of structural relationships (clustering). Without a loss of generality, we could say that the real supervisor, in this case, is the data graph (representing the relationships) that corrects the decisions according to the underlying informational layer. To better understand the logic, we can imagine that we have a set of users, but only 1% of them have been labeled (for simplicity, let's suppose that they are uniformly distributed). Our goal is to find the most accurate labels for the remaining part. A clustering algorithm can rearrange the structure according to the similarities (as the labeled samples are uniform, we can expect to find unlabeled neighbors whose center is a labeled one). Under some assumptions, we can propagate the center's label to the neighbors, repeating this process until every sample becomes stable. At this point, the whole dataset is labeled and it's possible to employ other algorithms to perform specific operations. Clearly, this is only an example, but in real life, it's extremely common to find scenarios where the cost of labeling millions of samples is not justified considering the accuracy achieved by semi-supervised methods.

Reinforcement learning

Even if there are no actual supervisors, reinforcement learning is also based on feedback provided by the environment. However, in this case, the information is more qualitative and doesn't help the agent in determining a precise measure of its error. In reinforcement learning, this feedback is usually called reward (sometimes, a negative one is defined as a penalty), and it's useful to understand whether a certain action performed in a state is positive or not. The sequence of most useful actions is a policy that the agent has to learn in order to be able to always make the best decision in terms of the highest immediate and cumulative (future) reward. In other words, an action can also be imperfect, but in terms of a global policy, it has to offer the highest total reward. This concept is based on the idea that a rational agent always pursues the objectives that can increase his/her wealth. The ability to see over a distant horizon is a distinctive mark for advanced agents, while short-sighted ones are often unable to correctly evaluate the consequences of their immediate actions and so their strategies are always sub-optimal.

Reinforcement learning is particularly efficient when the environment is not completely deterministic, when it's often very dynamic, and when it's impossible to have a precise error measure. During the last few years, many classical algorithms have been applied to deep neural networks to learn the best policy for playing Atari video games and to teach an agent how to associate the right action with an input representing the state (usually, this is a screenshot or a memory dump).

In the following diagram, there's a schematic representation of a deep neural network that's been trained to play a famous Atari game:

The generic structure of a deep reinforcement learning architecture

As input, there are one or more subsequent screenshots (this can often be enough to capture the temporal dynamics as well). They are processed using different layers (discussed briefly later) to produce an output that represents the policy for a specific state transition. After applying this policy, the game produces a feedback (as a reward-penalty), and this result is used to refine the output until it becomes stable (so the states are correctly recognized and the suggested action is always the best one) and the total reward overcomes a predefined threshold.

We're going to discuss some examples of reinforcement learning in the Chapter 15, Introducing Neural Networks, and Chapter 16, Advanced Deep Learning Models, dedicated to introducing deep learning and TensorFlow. However, some common examples are as follows:

Automatic robot control
Game solving
Stock trade analysis based on feedback signals

Computational neuroscience

It's not surprising that many machine learning algorithms have been defined and refined thanks to the contribution of research in the field of computational neuroscience. On the other hand, the most diffused adaptive systems are the animals, thanks to their nervous systems, that allow an effective interaction with the environment. From a mechanistic viewpoint, we need to assume that all the processes working inside the gigantic network of neurons are responsible for all computational features, starting from low-level perception and progressing until the highest abstractions, like language, logic reasoning, artistic creation, and so forth.

At the beginning of 1900, Ramón y Cajal and Golgi discovered the structure of the nerve cells, the neurons, but it was necessary to fully understand that their behavior was purely computational. Both scientists drew sketches representing the input units (dendrites), the body (soma), the main channel (axon) and the output gates (synapses), however neither the dynamic nor the learning mechanism of a cell group was fully comprehended. The neuroscientific community was convinced that learning was equivalent to a continuous and structural change, but they weren't able to define exactly what was changing during a learning process. In 1949, the Canadian psychologist Donald Hebb proposed his famous rule (a broader discussion can be found in Mastering Machine Learning Algorithms, Bonaccorso G., Packt Publishing, 2018 and Theoretical Neuroscience, Dayan P., Abbott L. F., The MIT Press, 2005) that is focused on the synaptic plasticity of the neurons. In other words, the changing element is the number and the nature of the output gates which connect a unit to a large number of other neurons. Hebb understood that if a neuron produces a spike and a synapse propagates it to another neuron which behaves in the same way, the connection is strengthened, otherwise, it's weakened. This can seem like a very simplistic explanation, but it allows you to understand how elementary neural aggregates can perform operations such as detecting the borders of an object, denoising a signal, or even finding the dimensions with maximum variance (PCA).

The research in this field has continued until today, and many companies, together with high-level universities, are currently involved in studying the computational behavior of the brain using the most advanced neuroimaging technologies available. The discoveries are sometimes surprising, as they confirm what was only imagined and never observed. In particular, some areas of the brain can easily manage supervised and unsupervised problems, while others exploit reinforcement learning to predict the most likely future perceptions. For example, an animal quickly learns to associate the sound of steps to the possibility of facing a predator, and learns how to behave consequently. In the same way, the input coming from its eyes are manipulated so as to extract all those pieces of information that are useful to detect the objects. This denoising procedure is extremely common in machine learning and, surprisingly, many algorithms achieve the same goal that an animal brain does! Of course, the complexity of human minds is beyond any complete explanation, but the possibility to double-check these intuitions using computer software has dramatically increased the research speed. At the end of this book, we are going to discuss the basics of deep learning, which is the most advanced branch of machine learning. However, I invite the reader to try to understand all the dynamics (even when they seem very abstract) because the underlying logic is always based on very simple and natural mechanisms, and your brain is very likely to perform the same operations that you're learning while you read!