Now that we know how to build a regressor, it's important to understand how to evaluate the quality of a regressor as well. In this context, an error is defined as the difference between the actual value and the value that is predicted by the regressor.
Computing regression accuracy
Getting ready
Let's quickly take a look at the metrics that can be used to measure the quality of a regressor. A regressor can be evaluated using many different metrics. There is a module in the scikit-learn library that provides functionalities to compute all the following metrics. This is the sklearn.metrics module, which includes score functions, performance metrics, pairwise metrics, and distance computations.
How to do it...
Let's see how to compute regression accuracy in Python:
- Now we will use the functions available to evaluate the performance of the linear regression model we developed in the previous recipe:
import sklearn.metrics as sm
print("Mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred), 2))
print("Mean squared error =", round(sm.mean_squared_error(y_test, y_test_pred), 2))
print("Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2))
print("Explain variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2))
print("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))
The following results are returned:
Mean absolute error = 241907.27
Mean squared error = 81974851872.13
Median absolute error = 240861.94
Explain variance score = 0.98
R2 score = 0.98
An R2 score near 1 means that the model is able to predict the data very well. Keeping track of every single metric can get tedious, so we pick one or two metrics to evaluate our model. A good practice is to make sure that the mean squared error is low and the explained variance score is high.
How it works...
A regressor can be evaluated using many different metrics, such as the following:
- Mean absolute error: This is the average of absolute errors of all the data points in the given dataset.
- Mean squared error: This is the average of the squares of the errors of all the data points in the given dataset. It is one of the most popular metrics out there!
- Median absolute error: This is the median of all the errors in the given dataset. The main advantage of this metric is that it's robust to outliers. A single bad point in the test dataset wouldn't skew the entire error metric, as opposed to a mean error metric.
- Explained variance score: This score measures how well our model can account for the variation in our dataset. A score of 1.0 indicates that our model is perfect.
- R2 score: This is pronounced as R-squared, and this score refers to the coefficient of determination. This tells us how well the unknown samples will be predicted by our model. The best possible score is 1.0, but the score can be negative as well.
There's more...
The sklearn.metrics module contains a series of simple functions that measure prediction error:
- Functions ending with _score return a value to maximize; the higher the better
- Functions ending with _error or _loss return a value to minimize; the lower the better
See also
- Scikit-learn's official documentation of the sklearn.metrics module: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.
- Regression Analysis with R, Giuseppe Ciaburro, Packt Publishing.