Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Practical Machine Learning

You're reading from   Practical Machine Learning Learn how to build Machine Learning applications to solve real-world data analysis challenges with this Machine Learning book – packed with practical tutorials

Arrow left icon
Product type Paperback
Published in Jan 2016
Publisher Packt
ISBN-13 9781784399689
Length 468 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Sunila Gollapudi Sunila Gollapudi
Author Profile Icon Sunila Gollapudi
Sunila Gollapudi
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Introduction to Machine learning FREE CHAPTER 2. Machine learning and Large-scale datasets 3. An Introduction to Hadoop's Architecture and Ecosystem 4. Machine Learning Tools, Libraries, and Frameworks 5. Decision Tree based learning 6. Instance and Kernel Methods Based Learning 7. Association Rules based learning 8. Clustering based learning 9. Bayesian learning 10. Regression based learning 11. Deep learning 12. Reinforcement learning 13. Ensemble learning 14. New generation data architectures for Machine learning Index

Performance measures

Performance measures are used to evaluate learning algorithms and form an important aspect of machine learning. In some cases, these measures are also used as heuristics to build learning models.

Now let's explore the concept of the Probably Approximately Correct (PAC) theory. While we describe the accuracy of hypothesis, we usually talk about two types of uncertainties as per the PAC theory:

  • Approximate: This measures the extent to which an error is accepted for a hypothesis
  • Probability: This measure is the percentage certainty of the hypothesis being correct

The following graph shows how the number of samples grow with error, probability, and hypothesis:

Performance measures

Is the solution good?

The error measures for a classification and prediction problem are different. In this section, we will cover some of these error measures followed by how they can be addressed.

In a classification problem, you can have two different types of errors, which can be elegantly represented using the "confusion matrix". Let's say in our target marketing problem, we work on 10,000 customer records to predict which customers are likely to respond to our marketing effort.

After analyzing the campaign, you can construct the following table, where the columns are your predictions and the rows are the real observations:

Action

Predicted (that there will be a buy)

Predicted (that there will be no buy)

Actually bought

TP: 500

FN: 400

Actually did not buy

FP: 100

TN: 9000

In the principal diagonal, we have buyers and non-buyers for whom the prediction matched with reality. These are correct predictions. They are called true positive and true negative respectively. In the upper right-hand side, we have those who we predicted are non-buyers, but in reality are buyers. This is an error known as a false negative error. In the lower left-hand side, we have those we predicted as buyers, but are non-buyers. This is another error known as false positive.

Are both errors equally expensive for the customers? Actually no! If we predict that someone is a buyer and they turn out to be a non-buyer, the company at most would have lost money spent on a mail or a call. However, if we predicted that someone would not buy and they were in fact buyers, the company would not have called them based on this prediction and lost a customer. So, in this case, a false negative is much more expensive than a false positive error.

The Machine learning community uses three different error measures for classification problems:

  • Measure 1: Accuracy is the percent of predictions that were correct.

    Example: The "accuracy" was (9,000+500) out of 10,000 = 95%

  • Measure 2: Recall is the percent of positives cases that you were able to catch. If false positives are low, recall will be high.

    Example: The "recall" was 500 out of 600 = 83.33%

  • Measure 3: Precision is the percent of positive predictions that were correct. If false negatives are low, precision is high.

    Example: The "precision" was 500 out of 900 = 55.55%

In forecasting, you are predicting a continuous variable. So, the error measures are fairly different here. As usual, the error metrics are obtained by comparing the predictions of the models with the real values of the target variables and calculating the average error. Here are a few metrics.

Mean squared error (MSE)

To compute the MSE, we first take the square of the difference between the actual and predicted values of every record. We then take the average value of these squared errors. If the predicted value of the ith record is Pi and the actual value is Ai, then the MSE is:

Mean squared error (MSE)

It is also common to use the square root of this quantity called root mean square error (RMSE).

Mean absolute error (MAE)

To compute the MAE, we take the absolute difference between the predicted and actual values of every record. We then take the average of those absolute differences. The choice of performance metric depends on the application. The MSE is a good performance metric for many applications as it has more statistical grounding with variance. On the other hand, the MAE is more intuitive and less sensitive to outliers. Looking at the MAE and RMSE gives us additional information about the distribution of the errors. In regression, if the RMSE is close to the MAE, the model makes many relatively small errors. If the RMSE is close to the MAE2, the model makes a few but large errors.

Mean absolute error (MAE)

Normalized MSE and MAE (NMSE and NMAE)

Both the MSE and MAE do not indicate how big the error is as they are numeric values depending on the scale of the target variable. Comparing with a benchmarking index provides a better insight. The common practice is to take the mean of the primary attribute we are predicting and assume that our naïve prediction model is just the mean. Then we compute the MSE based on the naïve model and the original model. The ratio provides an insight into how good or bad our model is compared to the naïve model.

Normalized MSE and MAE (NMSE and NMAE)

A similar definition can also be used for the MAE.

Solving the errors: bias and variance

This trap of building highly customized higher order models is called over-fitting and is a critical concept. The resulting error is known as the variance of the model. Essentially, if we had taken a different training set, we would have obtained a very different model. Variance is a measure of the dependency of model on the training set. By the way, the model you see on the right most side (linear fit) is called under-fitting and the error caused due to under-fitting is called bias. In an under-fitting or high bias situation, the model does not explain the relationship between the data. Essentially, we're trying to fit an overly simplistic hypothesis, for example, linear where we should be looking for a higher order polynomial.

To avoid the trap of over-fitting and under-fitting, data scientists build the model on a training set and then find the error on a test set. They refine the model until the error in the test set comes down. As the model starts getting customized to the training data, the error on the test set starts going up. They stop refining the model after that point.

Let's analyze bias and variance a bit more in this chapter and learn a few practical ways of dealing with them. The error in any model can be represented as a combination of bias, variance, and random error. With Err(x)=Bias2+Variance+Irreducible Error in less complex models, the bias term is high, and in models with higher complexity, the variance term is high, as shown in the following figure:

Solving the errors: bias and variance

To reduce bias or variance, let's first ask this question. If a model has a high bias, how does its error vary as a function of the amount of data?

At a very low data size, any model can fit the data well (any model fits a single point, any linear model can fit two points, a quadratic can fit three points, and so on). So, the error of a high bias model on a training set starts minuscule and goes up with increasing data points. However, on the test set, the error remains high initially as the model is highly customized to the training set. As the model gets more and more refined, the error reduces and becomes equal to that of the training set.

The following graph depicts the situation clearly:

Solving the errors: bias and variance

The remedy for this situation could be one of the following:

  • Most likely, you are working with very few features, so you must find more features
  • Increase the complexity of the model by increasing polynomials and depth
  • Increasing the data size will not be of much help if the model has a high bias
Solving the errors: bias and variance

When you face such situations, you can try the following remedies (the reverse of the previous ones):

  • Most likely, you are working with too many features, so, you must reduce the features
  • Decrease the complexity of the model
  • Increasing the data size will be some help
You have been reading a chapter from
Practical Machine Learning
Published in: Jan 2016
Publisher: Packt
ISBN-13: 9781784399689
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime