Understanding predictive analytics

Data Science, predictive analytics, machine learning -- these terms are used in many ways and sometimes overlap each other. What they actually refer to is not always obvious.

Data science is one of the most popular technical domains whose trend erupted after the publication of the often cited Harvard Business Review article of October 2012, Data Scientist: The Sexiest Job of the 21st Century (https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century). Data science can be seen as an evolution from data mining and data analytics. Data mining is about exploring data to discover patterns that potentially lead to decisions and actions at the business level. Data science englobes data analytics and regroups a wider scope of domains, such as statistics, data visualization, predictive analytics, software engineering, and so on, under one very large umbrella.

Predictive analytics is the art of predicting future events based on past observations. It requires your data to be organized in a certain way with predictor variables and outcomes well identified. As the Danish politician Karl Kristian Steincke once said, "Making predictions is difficult especially about the future." (This quote has also been attributed to Niels Bohr, Yogi Berra and others by http://quoteinvestigator.com/2013/10/20/no-predict/). Predictive analytics applications are diverse and far ranging: predicting consumer behavior, natural events (weather, earthquakes, and so on), people's behavior or health, financial markets, industrial applications, and so on. Predictive analytics relies on supervised learning, where data and labels are given to train the model.

Machine learning comprises the tools, methods, and concepts for computers to optimize models used for predictive analytics or other goals.

Machine learning's scope is much larger than predictive analytics. Three different types of machine learning are usually considered:

Supervised learning: Assumes that a certain amount of training data with known outcomes is available and can be used to train the model. Predictive analytics is part of supervised learning.
Unsupervised learning: Is about finding patterns in existing data without knowing the outcome. Clustering customer behavior or reducing the dimensions of the dataset for visualization purposes are examples of unsupervised learning.
Reinforcement learning: Is the third type of machine learning, where agents learn to act on their own when given a set of rules and a specific reward schema. Examples of reinforcement learning applications include AlphaGo, Google's world championship Go algorithm, self-driving cars, and semi-autonomous robots. AlphaGo learned from thousands of past games and was able to beat the world Go champion in March 2016 (https://www.wired.com/2016/03/go-grandmaster-lee-sedol-grabs-consolation-win-googles-ai/). A classic reinforcement learning implementation follows this schema, where an agent adapts its actions on an environment based on the resulting rewards:

The difference between supervised and unsupervised learning in the context of binary classification and clustering is illustrated in the following two figures:

For supervised learning, the original dataset is composed of two classes (squares and circles), and we know from the start to which class each sample belongs. Giving that information to a binary classification algorithm allows for a somewhat optimized separation of the two classes. Once that separating frontier is known, the model (the line) can be used to predict the class of new samples depending on which side the sample ends up being:

In unsupervised learning, the different classes are not known. There is no ground truth. The data is given to an algorithm along with some parameters, such as the number of classes to be found, and the algorithm finds the best set of clusters in the original dataset according to a defined criteria or metric. The results may be very dependent on the initialization parameters. There is no truth, no accuracy, just an interpretation of the data. The following figure shows the results obtained by a clustering algorithm asked to find three classes in the original data:

The reader will notice at this point that the book is titled Amazon Machine Learning and not Amazon Predictive Analytics. This is a bit misleading, as machine learning covers many applications and problems besides predictive analytics. However, calling the service machine learning leaves the door open for Amazon to roll out future services that are not focused on predictive analytics. The following figure maps out the relationships between data science terms:

Building the simplest predictive analytics algorithm

Predictive analytics can be very simple. We introduce a very simple example of a predictive model in the context of binary classification based on a simple threshold.

Imagine that a truck transporting small oranges and large grapefruits runs off the road; all the boxes of fruits open up, and all the fruits end up mixed together. Equipped with a simple weighing scale and a way to roll the fruits out of the truck, you want to be able to separate them automatically based on their weights. You have some information on the average weights of small oranges (96g) and large grapefruits (166g).

According to the USDA, the average weight of a medium-sized orange is 131 grams, while a larger orange weighs approximately 184 grams, and a smaller one around 96 grams.

Large grapefruit (approx 4-1/2'' dia) 166g
Medium grapefruit (approx 4'' dia) 128g
Small grapefruit (approx 3-1/2'' dia) 100g

Your predictive model is the following:

You arbitrarily set a threshold of 130g
You weigh each fruit
If the fruit weighs more than 130g, it's a grapefruit; otherwise it's an orange

There! You have a robust reliable, predictive model that can be applied to all your mixed up fruits to separate them. Note that in this case, you've set the threshold with an educated guess. There was no machine learning involved.

In machine learning, the models learn by themselves. Instead of setting the threshold yourself, you let your program evolve and calculate the weight separation threshold of fruits by itself.

For that, you would set aside a certain number of oranges and grapefruits. This is called the training dataset. It's important that this training dataset has roughly the same number of oranges and grapefruits.

And you let the machine decide the threshold value by itself. A possible algorithm could be along these lines:

Set original weight estimation at w_0 = 100g to initialize and a counter k = 0
For each new fruit in the training dataset, adjust the weight estimation according to the following:

        For each new fruit_weight:
            w(k+1) = (k*w(k) + fruit_weight)/ (k+1)
            k = k+1

Assuming that your training dataset is representative of all the remaining fruits and that you have enough fruits, the threshold would converge under certain conditions to the best average between all the fruit weights. A value which you use to separate all the other fruits depending on whether they weight more or less than the threshold you estimated. The following plot shows the convergence of this crude algorithm to estimate the average weight of the fruits:

This problem is a typical binary classification model. If we had not two but three types of fruits (lemons, oranges, and grapefruit), we would have a multiclass classification problem.

In this example, we only have one predictor: the weight of the fruit. We could add another predictor such as the diameter. This would result in what is called a multivariate classification problem.

In practice, machine learning uses more complex algorithms such as the SGD, the linear algorithm used by Amazon ML. Other classic prediction algorithms include Support Vector Machines, Bayes classifiers, Random forests and so on. Each algorithm has its strength and set of assumptions on the dataset.

Regression versus classification

Amazon ML does two types of predictive analytics: classification and regression.

As discussed in the preceding paragraph, classification is about predicting a finite set of labels or categories for a given set of samples.

In the case of two classes, the problem is called Binary classification
When there are more than two classes and the classes are mutually exclusive, the problem is a multiclass classification problem
If the samples can belong to several classes at once, we talk about a multilabel classification problem

In short, classification is the prediction of a finite set of classes, labels, categories.

Examples of Binary classification are: buying outcome (yes/no), survival outcome (yes/no), anomaly detection (spam, bots), and so on
Examples of multiclass classification are: classifying object in images (fruits, cars, and so on), identifying a music genre, or a movement based on smartphone sensors, document classification and so on

In regression problems, the outcome has continuous values. Predicting age, weight, stock prices, salaries, rainfall, temperature, and so forth are all regression problems. We talk about multiple regression when there are several predictors and multivariate regression when the predictions predict several values for each sample. Amazon ML does univariate regression and classification, both binary and multiclass, but not multilabel.

Expanding regression to classification with logistic regression

Amazon ML uses a linear regression model for regression, binary, and multiclass predictions. Using the logistic regression model extends continuous regression to classification problems.

A simple regression model with one predictor is modeled as follows:

Here, x is the predictor, y is the outcome, and (a, b) are the model parameters. Each predicted value y is continuous and not bounded. How can we use that model to predict classes which are by definition categorical values?

Take the example of binary predictions. The method is to transform the continuous predictions that are not bounded into probabilities, which are all between 0 and 1. We then associate these probabilities to one of the two classes using a predefined threshold. This model is called the logistic regression model–misleading name as logistic regression is a classification model and not a regression one.

To transform continuous not bounded values into probabilities, we use the sigmoid function defined as follows:

This function transforms any real number into a value within the [0,1] interval. Its output can, therefore, be interpreted as a probability:

In conclusion, the way to do binary classification with a regression model is as follows:

Build the regression model, and estimate the real valued outcomes y.
Use the predicted value y as the argument of the sigmoid function. The result f(y) is a probability measure of belonging to one of the two classes.
Set a threshold T in [0,1]. All predicted samples with a probability f(y) > T belong to one class, others belong to the other class. The default value for T = 0.5.

Logistic regression is, by nature, a Binary classifier. There are several strategies to transform a binary classifier into a multi class classifier.

The one versus all (OvA) technique consists in selecting one class as positive and all the others as negative to go back to a binary classification problem. Once the classification on the first class is carried out, a second class is selected as the positive versus all the others as negative. This process is repeated N-1 times when there are N classes to predict. The following set of plots shows:

The original datasets and the classes for all the samples
The result of the first Binary classification (circles versus all the others)
The result of the second classification that separates the squares and the triangles

Extracting features to predict outcomes

That available data needs to be accessible and meaningful in order for the algorithm to extract information.

Let's consider a simple example. Imagine that we want to predict the market price of a house in a given city. We can think of many variables that would be predictors of the price of a house: the number of rooms or bathrooms, the neighborhood, the surface, the heating system, and so on. These variables are called features, attributes, or predictors. The value that we want to predict is called the outcome or the target.

If we want our predictions to be reliable, we need several features. Predicting the price of a house based on its surface alone would not be very efficient. Many other factors influence the price of a house and our dataset should include as many as possible (with conditions).

It's often possible to add large numbers of attributes to a model to try to improve the predictions. For instance, in our housing pricing prediction, we could add all the characteristics of the house (bathroom, superficies, heating system, the number of windows). Some of these variables would bring more information to our pricing model and increase the accuracy of our predictions, while others would just add noise and confuse the algorithm. Adding new variables to a predicting model does not always improve the predictions.

In order to make reliable predictions, each of the new features you bring to your model must bring some valuable piece of information. However, this is also not always the case. As we will see in Chapter 2, Machine Learning Definitions and Concepts, correlated predictors can hurt the performances of the model.

Predictive analytics is built on several assumptions and conditions:

The value you are trying to predict is predictable and not just some random noise.
You have access to data that has some degree of association to the target.
The available dataset is large enough. Reliable predictions cannot be inferred from a dataset that is too small. (For instance, you can define and therefore predict a line with two points but you cannot infer data that follows a sine curve from only two points.)
The new data you will base future predictions on is similar to the one you parameterized and trained your model on.

You may have a great dataset, but that does not mean it will be efficient for predictions.

These conditions on the data are very general. In the case of SGD, the conditions are more constrained.