Machine learning algorithms work with data. They create associations, find out relationships, discover patterns, generate new samples, and more, working with well-defined datasets. Unfortunately, sometimes the assumptions or the conditions imposed on them are not clear, and a lengthy training process can result in a complete validation failure. Even if this condition is stronger in deep learning contexts, we can think of a model as a gray box (some transparency is guaranteed by the simplicity of many common algorithms), where a vectorial input  is transformed into a vectorial output :
In the previous diagram, the model has been represented by a pseudo-function that depends on a set of parameters defined by the vector θ. In this section, we are only considering parametric models, although there's a family of algorithms that are called non-parametric, because they are based only on the structure of the data. We're going to discuss some of them in upcoming chapters.
The task of a parametric learning process is therefore to find the best parameter set that maximizes a target function whose value is proportional to the accuracy (or the error, if we are trying to minimize them) of the model given a specific input X and output Y. This definition is not very rigorous, and it will be improved in the following sections; however, it's useful as a way to understand the context we're working in.
Then, the first question to ask is: What is the nature of X? A machine learning problem is focused on learning abstract relationships that allow a consistent generalization when new samples are provided. More specifically, we can define a stochastic data generating process with an associated joint probability distribution:
Sometimes, it's useful to express the joint probability p(x, y) as a product of the conditional p(y|x), which expresses the probability of a label given a sample, and the marginal probability of the samples p(x). This expression is particularly useful when the prior probability p(x) is known in semi-supervised contexts, or when we are interested in solving problems using the Expectation Maximization (EM) algorithm. We're going to discuss this approach in upcoming chapters.
In many cases, we are not able to derive a precise distribution; however, when considering a dataset, we always assume that it's drawn from the original data-generating distribution. This condition isn't a purely theoretical assumption, because, as we're going to see, whenever our data points are drawn from different distributions, the accuracy of the model can dramatically decrease.
If we sample NÂ independent and identically distributed (i.i.d.) values from pdata, we can create a finite dataset X made up of k-dimensional real vectors:
In a supervised scenario, we also need the corresponding labels (with t output values):
When the output has more than two classes, there are different possible strategies to manage the problem. In classical machine learning, one of the most common approaches is One-vs-All, which is based on training N different binary classifiers where each label is evaluated against all the remaining ones. In this way, N-1Â is performed to determine the right class. With shallow and deep neural networks, instead, it's preferable to use a softmax function to represent the output probability distribution for all classes:
This kind of output (zi represents the intermediate values, and the sum of the terms is normalized to 1) can be easily managed using the cross-entropy cost function (see the corresponding paragraph in the Loss and cost functions section).Â