Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python: Deeper Insights into Machine Learning

You're reading from   Python: Deeper Insights into Machine Learning Deeper Insights into Machine Learning

Arrow left icon
Product type Course
Published in Aug 2016
Publisher Packt
ISBN-13 9781787128576
Length 901 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (3):
Arrow left icon
John Hearty John Hearty
Author Profile Icon John Hearty
John Hearty
Sebastian Raschka Sebastian Raschka
Author Profile Icon Sebastian Raschka
Sebastian Raschka
David Julian David Julian
Author Profile Icon David Julian
David Julian
Arrow right icon
View More author details
Toc

Table of Contents (6) Chapters Close

Preface 1. Module 1 FREE CHAPTER 2. Module 2 3. Module 3 A. Biblography
Index

Chapter 10. Predicting Continuous Target Variables with Regression Analysis

Throughout the previous chapters, you learned a lot about the main concepts behind supervised learning and trained many different models for classification tasks to predict group memberships or categorical variables. In this chapter, we will take a dive into another subcategory of supervised learning: regression analysis.

Regression models are used to predict target variables on a continuous scale, which makes them attractive for addressing many questions in science as well as applications in industry, such as understanding relationships between variables, evaluating trends, or making forecasts. One example would be predicting the sales of a company in future months.

In this chapter, we will discuss the main concepts of regression models and cover the following topics:

  • Exploring and visualizing datasets
  • Looking at different approaches to implement linear regression models
  • Training regression models that are robust to outliers
  • Evaluating regression models and diagnosing common problems
  • Fitting regression models to nonlinear data

Introducing a simple linear regression model

The goal of simple (univariate) linear regression is to model the relationship between a single feature (explanatory variable x) and a continuous valued response (target variable y). The equation of a linear model with one explanatory variable is defined as follows:

Introducing a simple linear regression model

Here, the weight Introducing a simple linear regression model represents the y axis intercepts and Introducing a simple linear regression model is the coefficient of the explanatory variable. Our goal is to learn the weights of the linear equation to describe the relationship between the explanatory variable and the target variable, which can then be used to predict the responses of new explanatory variables that were not part of the training dataset.

Based on the linear equation that we defined previously, linear regression can be understood as finding the best-fitting straight line through the sample points, as shown in the following figure:

Introducing a simple linear regression model

This best-fitting line is also called the regression line, and the vertical lines from the regression line to the sample points are the so-called offsets or residuals—the errors of our prediction.

The special case of one explanatory variable is also called simple linear regression, but of course we can also generalize the linear regression model to multiple explanatory variables. Hence, this process is called multiple linear regression:

Introducing a simple linear regression model

Here, Introducing a simple linear regression model is the y axis intercept with Introducing a simple linear regression model.

Exploring the Housing Dataset

Before we implement our first linear regression model, we will introduce a new dataset, the Housing Dataset, which contains information about houses in the suburbs of Boston collected by D. Harrison and D.L. Rubinfeld in 1978. The Housing Dataset has been made freely available and can be downloaded from the UCI machine learning repository at https://archive.ics.uci.edu/ml/datasets/Housing.

The features of the 506 samples may be summarized as shown in the excerpt of the dataset description:

  • CRIM: This is the per capita crime rate by town
  • ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
  • INDUS: This is the proportion of non-retail business acres per town
  • CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
  • NOX: This is the nitric oxides concentration (parts per 10 million)
  • RM: This is the average number of rooms per dwelling
  • AGE: This is the proportion of owner-occupied units built prior to 1940
  • DIS: This is the weighted distances to five Boston employment centers
  • RAD: This is the index of accessibility to radial highways
  • TAX: This is the full-value property-tax rate per $10,000
  • PTRATIO: This is the pupil-teacher ratio by town
  • B: This is calculated as 1000(Bk - 0.63)^2, where Bk is the proportion of people of African American descent by town
  • LSTAT: This is the percentage lower status of the population
  • MEDV: This is the median value of owner-occupied homes in $1000s

For the rest of this chapter, we will regard the housing prices (MEDV) as our target variable—the variable that we want to predict using one or more of the 13 explanatory variables. Before we explore this dataset further, let's fetch it from the UCI repository into a pandas DataFrame:

>>> import pandas as pd
>>> df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', 
...                 header=None, sep='\s+')
>>> df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 
...              'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
...              'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
>>> df.head()

To confirm that the dataset was loaded successfully, we displayed the first five lines of the dataset, as shown in the following screenshot:

Exploring the Housing Dataset

Visualizing the important characteristics of a dataset

Exploratory Data Analysis (EDA) is an important and recommended first step prior to the training of a machine learning model. In the rest of this section, we will use some simple yet useful techniques from the graphical EDA toolbox that may help us to visually detect the presence of outliers, the distribution of the data, and the relationships between features.

First, we will create a scatterplot matrix that allows us to visualize the pair-wise correlations between the different features in this dataset in one place. To plot the scatterplot matrix, we will use the pairplot function from the seaborn library (http://stanford.edu/~mwaskom/software/seaborn/), which is a Python library for drawing statistical plots based on matplotlib:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> sns.set(style='whitegrid', context='notebook')
>>> cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV']
>>> sns.pairplot(df[cols], size=2.5);
>>> plt.show()

As we can see in the following figure, the scatterplot matrix provides us with a useful graphical summary of the relationships in a dataset:

Visualizing the important characteristics of a dataset

Note

Importing the seaborn library modifies the default aesthetics of matplotlib for the current Python session. If you do not want to use seaborn's style settings, you can reset the matplotlib settings by executing the following command:

>>> sns.reset_orig()

Due to space constraints and for purposes of readability, we only plotted five columns from the dataset: LSTAT, INDUS, NOX, RM, and MEDV. However, you are encouraged to create a scatterplot matrix of the whole DataFrame to further explore the data.

Using this scatterplot matrix, we can now quickly eyeball how the data is distributed and whether it contains outliers. For example, we can see that there is a linear relationship between RM and the housing prices MEDV (the fifth column of the fourth row). Furthermore, we can see in the histogram (the lower right subplot in the scatter plot matrix) that the MEDV variable seems to be normally distributed but contains several outliers.

Note

Note that in contrast to common belief, training a linear regression model does not require that the explanatory or target variables are normally distributed. The normality assumption is only a requirement for certain statistical tests and hypothesis tests that are beyond the scope of this book (Montgomery, D. C., Peck, E. A., and Vining, G. G. Introduction to linear regression analysis. John Wiley and Sons, 2012, pp.318–319).

To quantify the linear relationship between the features, we will now create a correlation matrix. A correlation matrix is closely related to the covariance matrix that we have seen in the section about principal component analysis (PCA) in Chapter 4, Building Good Training Sets – Data Preprocessing. Intuitively, we can interpret the correlation matrix as a rescaled version of the covariance matrix. In fact, the correlation matrix is identical to a covariance matrix computed from standardized data.

The correlation matrix is a square matrix that contains the Pearson product-moment correlation coefficients (often abbreviated as Pearson's r), which measure the linear dependence between pairs of features. The correlation coefficients are bounded to the range -1 and 1. Two features have a perfect positive correlation if Visualizing the important characteristics of a dataset, no correlation if Visualizing the important characteristics of a dataset, and a perfect negative correlation if Visualizing the important characteristics of a dataset, respectively. As mentioned previously, Pearson's correlation coefficient can simply be calculated as the covariance between two features Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset (numerator) divided by the product of their standard deviations (denominator):

Visualizing the important characteristics of a dataset

Here, Visualizing the important characteristics of a dataset denotes the sample mean of the corresponding feature, Visualizing the important characteristics of a dataset is the covariance between the features Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset, and Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset are the features' standard deviations, respectively.

Note

We can show that the covariance between standardized features is in fact equal to their linear correlation coefficient.

Let's first standardize the features Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset, to obtain their z-scores which we will denote as Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset, respectively:

Visualizing the important characteristics of a dataset

Remember that we calculate the (population) covariance between two features as follows:

Visualizing the important characteristics of a dataset

Since standardization centers a feature variable at mean 0, we can now calculate the covariance between the scaled features as follows:

Visualizing the important characteristics of a dataset

Through resubstitution, we get the following result:

Visualizing the important characteristics of a dataset
Visualizing the important characteristics of a dataset

We can simplify it as follows:

Visualizing the important characteristics of a dataset

In the following code example, we will use NumPy's corrcoef function on the five feature columns that we previously visualized in the scatterplot matrix, and we will use seaborn's heatmap function to plot the correlation matrix array as a heat map:

>>> import numpy as np
>>> cm = np.corrcoef(df[cols].values.T)
>>> sns.set(font_scale=1.5)
>>> hm = sns.heatmap(cm, 
...            cbar=True,
...            annot=True, 
...            square=True,
...            fmt='.2f',
...            annot_kws={'size': 15},
...            yticklabels=cols,
...            xticklabels=cols)
>>> plt.show()

As we can see in the resulting figure, the correlation matrix provides us with another useful summary graphic that can help us to select features based on their respective linear correlations:

Visualizing the important characteristics of a dataset

To fit a linear regression model, we are interested in those features that have a high correlation with our target variable MEDV. Looking at the preceding correlation matrix, we see that our target variable MEDV shows the largest correlation with the LSTAT variable (-0.74). However, as you might remember from the scatterplot matrix, there is a clear nonlinear relationship between LSTAT and MEDV. On the other hand, the correlation between RM and MEDV is also relatively high (0.70) and given the linear relationship between those two variables that we observed in the scatterplot, RM seems to be a good choice for an exploratory variable to introduce the concepts of a simple linear regression model in the following section.

Visualizing the important characteristics of a dataset

Exploratory Data Analysis (EDA) is an important and recommended first step prior to the training of a machine learning model. In the rest of this section, we will use some simple yet useful techniques from the graphical EDA toolbox that may help us to visually detect the presence of outliers, the distribution of the data, and the relationships between features.

First, we will create a scatterplot matrix that allows us to visualize the pair-wise correlations between the different features in this dataset in one place. To plot the scatterplot matrix, we will use the pairplot function from the seaborn library (http://stanford.edu/~mwaskom/software/seaborn/), which is a Python library for drawing statistical plots based on matplotlib:

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> sns.set(style='whitegrid', context='notebook')
>>> cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV']
>>> sns.pairplot(df[cols], size=2.5);
>>> plt.show()

As we can see in the following figure, the scatterplot matrix provides us with a useful graphical summary of the relationships in a dataset:

Visualizing the important characteristics of a dataset

Note

Importing the seaborn library modifies the default aesthetics of matplotlib for the current Python session. If you do not want to use seaborn's style settings, you can reset the matplotlib settings by executing the following command:

>>> sns.reset_orig()

Due to space constraints and for purposes of readability, we only plotted five columns from the dataset: LSTAT, INDUS, NOX, RM, and MEDV. However, you are encouraged to create a scatterplot matrix of the whole DataFrame to further explore the data.

Using this scatterplot matrix, we can now quickly eyeball how the data is distributed and whether it contains outliers. For example, we can see that there is a linear relationship between RM and the housing prices MEDV (the fifth column of the fourth row). Furthermore, we can see in the histogram (the lower right subplot in the scatter plot matrix) that the MEDV variable seems to be normally distributed but contains several outliers.

Note

Note that in contrast to common belief, training a linear regression model does not require that the explanatory or target variables are normally distributed. The normality assumption is only a requirement for certain statistical tests and hypothesis tests that are beyond the scope of this book (Montgomery, D. C., Peck, E. A., and Vining, G. G. Introduction to linear regression analysis. John Wiley and Sons, 2012, pp.318–319).

To quantify the linear relationship between the features, we will now create a correlation matrix. A correlation matrix is closely related to the covariance matrix that we have seen in the section about principal component analysis (PCA) in Chapter 4, Building Good Training Sets – Data Preprocessing. Intuitively, we can interpret the correlation matrix as a rescaled version of the covariance matrix. In fact, the correlation matrix is identical to a covariance matrix computed from standardized data.

The correlation matrix is a square matrix that contains the Pearson product-moment correlation coefficients (often abbreviated as Pearson's r), which measure the linear dependence between pairs of features. The correlation coefficients are bounded to the range -1 and 1. Two features have a perfect positive correlation if Visualizing the important characteristics of a dataset, no correlation if Visualizing the important characteristics of a dataset, and a perfect negative correlation if Visualizing the important characteristics of a dataset, respectively. As mentioned previously, Pearson's correlation coefficient can simply be calculated as the covariance between two features Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset (numerator) divided by the product of their standard deviations (denominator):

Visualizing the important characteristics of a dataset

Here, Visualizing the important characteristics of a dataset denotes the sample mean of the corresponding feature, Visualizing the important characteristics of a dataset is the covariance between the features Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset, and Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset are the features' standard deviations, respectively.

Note

We can show that the covariance between standardized features is in fact equal to their linear correlation coefficient.

Let's first standardize the features Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset, to obtain their z-scores which we will denote as Visualizing the important characteristics of a dataset and Visualizing the important characteristics of a dataset, respectively:

Visualizing the important characteristics of a dataset

Remember that we calculate the (population) covariance between two features as follows:

Visualizing the important characteristics of a dataset

Since standardization centers a feature variable at mean 0, we can now calculate the covariance between the scaled features as follows:

Visualizing the important characteristics of a dataset

Through resubstitution, we get the following result:

Visualizing the important characteristics of a dataset
Visualizing the important characteristics of a dataset

We can simplify it as follows:

Visualizing the important characteristics of a dataset

In the following code example, we will use NumPy's corrcoef function on the five feature columns that we previously visualized in the scatterplot matrix, and we will use seaborn's heatmap function to plot the correlation matrix array as a heat map:

>>> import numpy as np
>>> cm = np.corrcoef(df[cols].values.T)
>>> sns.set(font_scale=1.5)
>>> hm = sns.heatmap(cm, 
...            cbar=True,
...            annot=True, 
...            square=True,
...            fmt='.2f',
...            annot_kws={'size': 15},
...            yticklabels=cols,
...            xticklabels=cols)
>>> plt.show()

As we can see in the resulting figure, the correlation matrix provides us with another useful summary graphic that can help us to select features based on their respective linear correlations:

Visualizing the important characteristics of a dataset

To fit a linear regression model, we are interested in those features that have a high correlation with our target variable MEDV. Looking at the preceding correlation matrix, we see that our target variable MEDV shows the largest correlation with the LSTAT variable (-0.74). However, as you might remember from the scatterplot matrix, there is a clear nonlinear relationship between LSTAT and MEDV. On the other hand, the correlation between RM and MEDV is also relatively high (0.70) and given the linear relationship between those two variables that we observed in the scatterplot, RM seems to be a good choice for an exploratory variable to introduce the concepts of a simple linear regression model in the following section.

Implementing an ordinary least squares linear regression model

At the beginning of this chapter, we discussed that linear regression can be understood as finding the best-fitting straight line through the sample points of our training data. However, we have neither defined the term best-fitting nor have we discussed the different techniques of fitting such a model. In the following subsections, we will fill in the missing pieces of this puzzle using the Ordinary Least Squares (OLS) method to estimate the parameters of the regression line that minimizes the sum of the squared vertical distances (residuals or errors) to the sample points.

Solving regression for regression parameters with gradient descent

Consider our implementation of the ADAptive LInear NEuron (Adaline) from Chapter 2, Training Machine Learning Algorithms for Classification; we remember that the artificial neuron uses a linear activation function and we defined a cost function Solving regression for regression parameters with gradient descent, which we minimized to learn the weights via optimization algorithms, such as Gradient Descent (GD) and Stochastic Gradient Descent (SGD). This cost function in Adaline is the Sum of Squared Errors (SSE). This is identical to the OLS cost function that we defined:

Solving regression for regression parameters with gradient descent

Here, Solving regression for regression parameters with gradient descent is the predicted value Solving regression for regression parameters with gradient descent (note that the term 1/2 is just used for convenience to derive the update rule of GD). Essentially, OLS linear regression can be understood as Adaline without the unit step function so that we obtain continuous target values instead of the class labels -1 and 1. To demonstrate the similarity, let's take the GD implementation of Adaline from Chapter 2, Training Machine Learning Algorithms for Classification, and remove the unit step function to implement our first linear regression model:

class LinearRegressionGD(object):

    def __init__(self, eta=0.001, n_iter=20):
        self.eta = eta
        self.n_iter = n_iter

    def fit(self, X, y):
        self.w_ = np.zeros(1 + X.shape[1])
        self.cost_ = []

        for i in range(self.n_iter):
            output = self.net_input(X)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)
            self.w_[0] += self.eta * errors.sum()
            cost = (errors**2).sum() / 2.0
            self.cost_.append(cost)
        return self

    def net_input(self, X):
        return np.dot(X, self.w_[1:]) + self.w_[0]

    def predict(self, X):
        return self.net_input(X)

If you need a refresher about how the weights are being updated—taking a step in the opposite direction of the gradient—please revisit the Adaline section in Chapter 2, Training Machine Learning Algorithms for Classification.

To see our LinearRegressionGD regressor in action, let's use the RM (number of rooms) variable from the Housing Data Set as the explanatory variable to train a model that can predict MEDV (the housing prices). Furthermore, we will standardize the variables for better convergence of the GD algorithm. The code is as follows:

>>> X = df[['RM']].values
>>> y = df['MEDV'].values
>>> from sklearn.preprocessing import StandardScaler
>>> sc_x = StandardScaler()
>>> sc_y = StandardScaler()
>>> X_std = sc_x.fit_transform(X)
>>> y_std = sc_y.fit_transform(y)
>>> lr = LinearRegressionGD()
>>> lr.fit(X_std, y_std)

We discussed in Chapter 2, Training Machine Learning Algorithms for Classification, that it is always a good idea to plot the cost as a function of the number of epochs (passes over the training dataset) when we are using optimization algorithms, such as gradient descent, to check for convergence. To cut a long story short, let's plot the cost against the number of epochs to check if the linear regression has converged:

>>> plt.plot(range(1, lr.n_iter+1), lr.cost_)
>>> plt.ylabel('SSE')
>>> plt.xlabel('Epoch')
>>> plt.show()

As we can see in the following plot, the GD algorithm converged after the fifth epoch:

Solving regression for regression parameters with gradient descent

Next, let's visualize how well the linear regression line fits the training data. To do so, we will define a simple helper function that will plot a scatterplot of the training samples and add the regression line:

>>> def lin_regplot(X, y, model):
...     plt.scatter(X, y, c='blue')
...     plt.plot(X, model.predict(X), color='red')    
...     return None

Now, we will use this lin_regplot function to plot the number of rooms against house prices:

>>> lin_regplot(X_std, y_std, lr)
>>> plt.xlabel('Average number of rooms [RM] (standardized)')
>>> plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
>>> plt.show()

As we can see in the following plot, the linear regression line reflects the general trend that house prices tend to increase with the number of rooms:

Solving regression for regression parameters with gradient descent

Although this observation makes intuitive sense, the data also tells us that the number of rooms does not explain the house prices very well in many cases. Later in this chapter, we will discuss how to quantify the performance of a regression model. Interestingly, we also observe a curious line Solving regression for regression parameters with gradient descent, which suggests that the prices may have been clipped. In certain applications, it may also be important to report the predicted outcome variables on its original scale. To scale the predicted price outcome back on the Price in $1000's axes, we can simply apply the inverse_transform method of the StandardScaler:

>>> num_rooms_std = sc_x.transform([5.0]) 
>>> price_std = lr.predict(num_rooms_std)
>>> print("Price in $1000's: %.3f" % \
...       sc_y.inverse_transform(price_std))
Price in $1000's: 10.840

In the preceding code example, we used the previously trained linear regression model to predict the price of a house with five rooms. According to our model, such a house is worth $10,840.

On a side note, it is also worth mentioning that we technically don't have to update the weights of the intercept if we are working with standardized variables since the y axis intercept is always 0 in those cases. We can quickly confirm this by printing the weights:

>>> print('Slope: %.3f' % lr.w_[1])
Slope: 0.695
>>> print('Intercept: %.3f' % lr.w_[0])
Intercept: -0.000

Estimating the coefficient of a regression model via scikit-learn

In the previous section, we implemented a working model for regression analysis. However, in a real-world application, we may be interested in more efficient implementations, for example, scikit-learn's LinearRegression object that makes use of the LIBLINEAR library and advanced optimization algorithms that work better with unstandardized variables. This is sometimes desirable for certain applications:

>>> from sklearn.linear_model import LinearRegression
>>> slr = LinearRegression()
>>> slr.fit(X, y)
>>> print('Slope: %.3f' % slr.coef_[0])
Slope: 9.102
>>> print('Intercept: %.3f' % slr.intercept_)
Intercept: -34.671

As we can see by executing the preceding code, scikit-learn's LinearRegression model fitted with the unstandardized RM and MEDV variables yielded different model coefficients. Let's compare it to our own GD implementation by plotting MEDV against RM:

>>> lin_regplot(X, y, slr)
>>> plt.xlabel('Average number of rooms [RM] (standardized)')
>>> plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
>>> plt.show()

Now, when we plot the training data and our fitted model by executing the code above, we can see that the overall result looks identical to our GD implementation:

Estimating the coefficient of a regression model via scikit-learn

Note

As an alternative to using machine learning libraries, there is a closed-form solution for solving OLS involving a system of linear equations that can be found in most introductory statistics textbooks:

Estimating the coefficient of a regression model via scikit-learn
Estimating the coefficient of a regression model via scikit-learn

Here, Estimating the coefficient of a regression model via scikit-learn is the mean of the true target values and Estimating the coefficient of a regression model via scikit-learn is the mean of the predicted response.

The advantage of this method is that it is guaranteed to find the optimal solution analytically. However, if we are working with very large datasets, it can be computationally too expensive to invert the matrix in this formula (sometimes also called the normal equation) or the sample matrix may be singular (non-invertible), which is why we may prefer iterative methods in certain cases.

If you are interested in more information on how to obtain the normal equations, I recommend you take a look at Dr. Stephen Pollock's chapter, The Classical Linear Regression Model from his lectures at the University of Leicester, which are available for free at http://www.le.ac.uk/users/dsgp1/COURSES/MESOMET/ECMETXT/06mesmet.pdf.

Solving regression for regression parameters with gradient descent

Consider our implementation of the ADAptive LInear NEuron (Adaline) from Chapter 2, Training Machine Learning Algorithms for Classification; we remember that the artificial neuron uses a linear activation function and we defined a cost function Solving regression for regression parameters with gradient descent, which we minimized to learn the weights via optimization algorithms, such as Gradient Descent (GD) and Stochastic Gradient Descent (SGD). This cost function in Adaline is the Sum of Squared Errors (SSE). This is identical to the OLS cost function that we defined:

Solving regression for regression parameters with gradient descent

Here, Solving regression for regression parameters with gradient descent is the predicted value Solving regression for regression parameters with gradient descent (note that the term 1/2 is just used for convenience to derive the update rule of GD). Essentially, OLS linear regression can be understood as Adaline without the unit step function so that we obtain continuous target values instead of the class labels -1 and 1. To demonstrate the similarity, let's take the GD implementation of Adaline from Chapter 2, Training Machine Learning Algorithms for Classification, and remove the unit step function to implement our first linear regression model:

class LinearRegressionGD(object):

    def __init__(self, eta=0.001, n_iter=20):
        self.eta = eta
        self.n_iter = n_iter

    def fit(self, X, y):
        self.w_ = np.zeros(1 + X.shape[1])
        self.cost_ = []

        for i in range(self.n_iter):
            output = self.net_input(X)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)
            self.w_[0] += self.eta * errors.sum()
            cost = (errors**2).sum() / 2.0
            self.cost_.append(cost)
        return self

    def net_input(self, X):
        return np.dot(X, self.w_[1:]) + self.w_[0]

    def predict(self, X):
        return self.net_input(X)

If you need a refresher about how the weights are being updated—taking a step in the opposite direction of the gradient—please revisit the Adaline section in Chapter 2, Training Machine Learning Algorithms for Classification.

To see our LinearRegressionGD regressor in action, let's use the RM (number of rooms) variable from the Housing Data Set as the explanatory variable to train a model that can predict MEDV (the housing prices). Furthermore, we will standardize the variables for better convergence of the GD algorithm. The code is as follows:

>>> X = df[['RM']].values
>>> y = df['MEDV'].values
>>> from sklearn.preprocessing import StandardScaler
>>> sc_x = StandardScaler()
>>> sc_y = StandardScaler()
>>> X_std = sc_x.fit_transform(X)
>>> y_std = sc_y.fit_transform(y)
>>> lr = LinearRegressionGD()
>>> lr.fit(X_std, y_std)

We discussed in Chapter 2, Training Machine Learning Algorithms for Classification, that it is always a good idea to plot the cost as a function of the number of epochs (passes over the training dataset) when we are using optimization algorithms, such as gradient descent, to check for convergence. To cut a long story short, let's plot the cost against the number of epochs to check if the linear regression has converged:

>>> plt.plot(range(1, lr.n_iter+1), lr.cost_)
>>> plt.ylabel('SSE')
>>> plt.xlabel('Epoch')
>>> plt.show()

As we can see in the following plot, the GD algorithm converged after the fifth epoch:

Solving regression for regression parameters with gradient descent

Next, let's visualize how well the linear regression line fits the training data. To do so, we will define a simple helper function that will plot a scatterplot of the training samples and add the regression line:

>>> def lin_regplot(X, y, model):
...     plt.scatter(X, y, c='blue')
...     plt.plot(X, model.predict(X), color='red')    
...     return None

Now, we will use this lin_regplot function to plot the number of rooms against house prices:

>>> lin_regplot(X_std, y_std, lr)
>>> plt.xlabel('Average number of rooms [RM] (standardized)')
>>> plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
>>> plt.show()

As we can see in the following plot, the linear regression line reflects the general trend that house prices tend to increase with the number of rooms:

Solving regression for regression parameters with gradient descent

Although this observation makes intuitive sense, the data also tells us that the number of rooms does not explain the house prices very well in many cases. Later in this chapter, we will discuss how to quantify the performance of a regression model. Interestingly, we also observe a curious line Solving regression for regression parameters with gradient descent, which suggests that the prices may have been clipped. In certain applications, it may also be important to report the predicted outcome variables on its original scale. To scale the predicted price outcome back on the Price in $1000's axes, we can simply apply the inverse_transform method of the StandardScaler:

>>> num_rooms_std = sc_x.transform([5.0]) 
>>> price_std = lr.predict(num_rooms_std)
>>> print("Price in $1000's: %.3f" % \
...       sc_y.inverse_transform(price_std))
Price in $1000's: 10.840

In the preceding code example, we used the previously trained linear regression model to predict the price of a house with five rooms. According to our model, such a house is worth $10,840.

On a side note, it is also worth mentioning that we technically don't have to update the weights of the intercept if we are working with standardized variables since the y axis intercept is always 0 in those cases. We can quickly confirm this by printing the weights:

>>> print('Slope: %.3f' % lr.w_[1])
Slope: 0.695
>>> print('Intercept: %.3f' % lr.w_[0])
Intercept: -0.000

Estimating the coefficient of a regression model via scikit-learn

In the previous section, we implemented a working model for regression analysis. However, in a real-world application, we may be interested in more efficient implementations, for example, scikit-learn's LinearRegression object that makes use of the LIBLINEAR library and advanced optimization algorithms that work better with unstandardized variables. This is sometimes desirable for certain applications:

>>> from sklearn.linear_model import LinearRegression
>>> slr = LinearRegression()
>>> slr.fit(X, y)
>>> print('Slope: %.3f' % slr.coef_[0])
Slope: 9.102
>>> print('Intercept: %.3f' % slr.intercept_)
Intercept: -34.671

As we can see by executing the preceding code, scikit-learn's LinearRegression model fitted with the unstandardized RM and MEDV variables yielded different model coefficients. Let's compare it to our own GD implementation by plotting MEDV against RM:

>>> lin_regplot(X, y, slr)
>>> plt.xlabel('Average number of rooms [RM] (standardized)')
>>> plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
>>> plt.show()

Now, when we plot the training data and our fitted model by executing the code above, we can see that the overall result looks identical to our GD implementation:

Estimating the coefficient of a regression model via scikit-learn

Note

As an alternative to using machine learning libraries, there is a closed-form solution for solving OLS involving a system of linear equations that can be found in most introductory statistics textbooks:

Estimating the coefficient of a regression model via scikit-learn
Estimating the coefficient of a regression model via scikit-learn

Here, Estimating the coefficient of a regression model via scikit-learn is the mean of the true target values and Estimating the coefficient of a regression model via scikit-learn is the mean of the predicted response.

The advantage of this method is that it is guaranteed to find the optimal solution analytically. However, if we are working with very large datasets, it can be computationally too expensive to invert the matrix in this formula (sometimes also called the normal equation) or the sample matrix may be singular (non-invertible), which is why we may prefer iterative methods in certain cases.

If you are interested in more information on how to obtain the normal equations, I recommend you take a look at Dr. Stephen Pollock's chapter, The Classical Linear Regression Model from his lectures at the University of Leicester, which are available for free at http://www.le.ac.uk/users/dsgp1/COURSES/MESOMET/ECMETXT/06mesmet.pdf.

Estimating the coefficient of a regression model via scikit-learn

In the previous section, we implemented a working model for regression analysis. However, in a real-world application, we may be interested in more efficient implementations, for example, scikit-learn's LinearRegression object that makes use of the LIBLINEAR library and advanced optimization algorithms that work better with unstandardized variables. This is sometimes desirable for certain applications:

>>> from sklearn.linear_model import LinearRegression
>>> slr = LinearRegression()
>>> slr.fit(X, y)
>>> print('Slope: %.3f' % slr.coef_[0])
Slope: 9.102
>>> print('Intercept: %.3f' % slr.intercept_)
Intercept: -34.671

As we can see by executing the preceding code, scikit-learn's LinearRegression model fitted with the unstandardized RM and MEDV variables yielded different model coefficients. Let's compare it to our own GD implementation by plotting MEDV against RM:

>>> lin_regplot(X, y, slr)
>>> plt.xlabel('Average number of rooms [RM] (standardized)')
>>> plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
>>> plt.show()

Now, when we plot the training data and our fitted model by executing the code above, we can see that the overall result looks identical to our GD implementation:

Estimating the coefficient of a regression model via scikit-learn

Note

As an alternative to using machine learning libraries, there is a closed-form solution for solving OLS involving a system of linear equations that can be found in most introductory statistics textbooks:

Estimating the coefficient of a regression model via scikit-learn
Estimating the coefficient of a regression model via scikit-learn

Here, Estimating the coefficient of a regression model via scikit-learn is the mean of the true target values and Estimating the coefficient of a regression model via scikit-learn is the mean of the predicted response.

The advantage of this method is that it is guaranteed to find the optimal solution analytically. However, if we are working with very large datasets, it can be computationally too expensive to invert the matrix in this formula (sometimes also called the normal equation) or the sample matrix may be singular (non-invertible), which is why we may prefer iterative methods in certain cases.

If you are interested in more information on how to obtain the normal equations, I recommend you take a look at Dr. Stephen Pollock's chapter, The Classical Linear Regression Model from his lectures at the University of Leicester, which are available for free at http://www.le.ac.uk/users/dsgp1/COURSES/MESOMET/ECMETXT/06mesmet.pdf.

Fitting a robust regression model using RANSAC

Linear regression models can be heavily impacted by the presence of outliers. In certain situations, a very small subset of our data can have a big effect on the estimated model coefficients. There are many statistical tests that can be used to detect outliers, which are beyond the scope of the book. However, removing outliers always requires our own judgment as a data scientist, as well as our domain knowledge.

As an alternative to throwing out outliers, we will look at a robust method of regression using the RANdom SAmple Consensus (RANSAC) algorithm, which fits a regression model to a subset of the data, the so-called inliers.

We can summarize the iterative RANSAC algorithm as follows:

  1. Select a random number of samples to be inliers and fit the model.
  2. Test all other data points against the fitted model and add those points that fall within a user-given tolerance to the inliers.
  3. Refit the model using all inliers.
  4. Estimate the error of the fitted model versus the inliers.
  5. Terminate the algorithm if the performance meets a certain user-defined threshold or if a fixed number of iterations has been reached; go back to step 1 otherwise.

Let's now wrap our linear model in the RANSAC algorithm using scikit-learn's RANSACRegressor object:

>>> from sklearn.linear_model import RANSACRegressor
>>> ransac = RANSACRegressor(LinearRegression(), 
...            max_trials=100, 
...            min_samples=50, 
...            residual_metric=lambda x: np.sum(np.abs(x), axis=1), 
...            residual_threshold=5.0, 
...            random_state=0)
>>> ransac.fit(X, y)

We set the maximum number of iterations of the RANSACRegressor to 100, and using min_samples=50, we set the minimum number of the randomly chosen samples to be at least 50. Using the residual_metric parameter, we provided a callable lambda function that simply calculates the absolute vertical distances between the fitted line and the sample points. By setting the residual_threshold parameter to 5.0, we only allowed samples to be included in the inlier set if their vertical distance to the fitted line is within 5 distance units, which works well on this particular dataset. By default, scikit-learn uses the MAD estimate to select the inlier threshold, where MAD stands for the Median Absolute Deviation of the target values y. However, the choice of an appropriate value for the inlier threshold is problem-specific, which is one disadvantage of RANSAC. Many different approaches have been developed over the recent years to select a good inlier threshold automatically. You can find a detailed discussion in R. Toldo and A. Fusiello's. Automatic Estimation of the Inlier Threshold in Robust Multiple Structures Fitting (in Image Analysis and Processing–ICIAP 2009, pages 123–131. Springer, 2009).

After we have fitted the RANSAC model, let's obtain the inliers and outliers from the fitted RANSAC linear regression model and plot them together with the linear fit:

>>> inlier_mask = ransac.inlier_mask_
>>> outlier_mask = np.logical_not(inlier_mask)
>>> line_X = np.arange(3, 10, 1)
>>> line_y_ransac = ransac.predict(line_X[:, np.newaxis])
>>> plt.scatter(X[inlier_mask], y[inlier_mask], 
...             c='blue', marker='o', label='Inliers')
>>> plt.scatter(X[outlier_mask], y[outlier_mask],
...             c='lightgreen', marker='s', label='Outliers')
>>> plt.plot(line_X, line_y_ransac, color='red')
>>> plt.xlabel('Average number of rooms [RM]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.legend(loc='upper left')
>>> plt.show()

As we can see in the following scatterplot, the linear regression model was fitted on the detected set of inliers shown as circles:

Fitting a robust regression model using RANSAC

When we print the slope and intercept of the model executing the following code, we can see that the linear regression line is slightly different from the fit that we obtained in the previous section without RANSAC:

>>> print('Slope: %.3f' % ransac.estimator_.coef_[0])
Slope: 9.621
>>> print('Intercept: %.3f' % ransac.estimator_.intercept_)
Intercept: -37.137

Using RANSAC, we reduced the potential effect of the outliers in this dataset, but we don't know if this approach has a positive effect on the predictive performance for unseen data. Thus, in the next section we will discuss how to evaluate a regression model for different approaches, which is a crucial part of building systems for predictive modeling.

Evaluating the performance of linear regression models

In the previous section, we discussed how to fit a regression model on training data. However, you learned in previous chapters that it is crucial to test the model on data that it hasn't seen during training to obtain an unbiased estimate of its performance.

As we remember from Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, we want to split our dataset into separate training and test datasets where we use the former to fit the model and the latter to evaluate its performance to generalize to unseen data. Instead of proceeding with the simple regression model, we will now use all variables in the dataset and train a multiple regression model:

>>> from sklearn.cross_validation import train_test_split
>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test = train_test_split(
...       X, y, test_size=0.3, random_state=0)
>>> slr = LinearRegression()
>>> slr.fit(X_train, y_train)
>>> y_train_pred = slr.predict(X_train)
>>> y_test_pred = slr.predict(X_test)

Since our model uses multiple explanatory variables, we can't visualize the linear regression line (or hyperplane to be precise) in a two-dimensional plot, but we can plot the residuals (the differences or vertical distances between the actual and predicted values) versus the predicted values to diagnose our regression model. Those residual plots are a commonly used graphical analysis for diagnosing regression models to detect nonlinearity and outliers, and to check if the errors are randomly distributed.

Using the following code, we will now plot a residual plot where we simply subtract the true target variables from our predicted responses:

>>> plt.scatter(y_train_pred, y_train_pred - y_train, 
...             c='blue', marker='o', label='Training data')
>>> plt.scatter(y_test_pred,  y_test_pred - y_test,
...             c='lightgreen', marker='s', label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
>>> plt.xlim([-10, 50])
>>> plt.show()

After executing the code, we should see a residual plot with a line passing through the x axis origin as shown here:

Evaluating the performance of linear regression models

In the case of a perfect prediction, the residuals would be exactly zero, which we will probably never encounter in realistic and practical applications. However, for a good regression model, we would expect that the errors are randomly distributed and the residuals should be randomly scattered around the centerline. If we see patterns in a residual plot, it means that our model is unable to capture some explanatory information, which is leaked into the residuals as we can slightly see in our preceding residual plot. Furthermore, we can also use residual plots to detect outliers, which are represented by the points with a large deviation from the centerline.

Another useful quantitative measure of a model's performance is the so-called Mean Squared Error (MSE), which is simply the average value of the SSE cost function that we minimize to fit the linear regression model. The MSE is useful to for comparing different regression models or for tuning their parameters via a grid search and cross-validation:

Evaluating the performance of linear regression models

Execute the following code:

>>> from sklearn.metrics import mean_squared_error
>>> print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))

We will see that the MSE on the training set is 19.96, and the MSE of the test set is much larger with a value of 27.20, which is an indicator that our model is overfitting the training data.

Sometimes it may be more useful to report the coefficient of determination (Evaluating the performance of linear regression models), which can be understood as a standardized version of the MSE, for better interpretability of the model performance. In other words, Evaluating the performance of linear regression models is the fraction of response variance that is captured by the model. The Evaluating the performance of linear regression models value is defined as follows:

Evaluating the performance of linear regression models

Here, SSE is the sum of squared errors and SST is the total sum of squares Evaluating the performance of linear regression models, or in other words, it is simply the variance of the response.

Let's quickly show that Evaluating the performance of linear regression models is indeed just a rescaled version of the MSE:

Evaluating the performance of linear regression models
Evaluating the performance of linear regression models
Evaluating the performance of linear regression models

For the training dataset, Evaluating the performance of linear regression models is bounded between 0 and 1, but it can become negative for the test set. If Evaluating the performance of linear regression models, the model fits the data perfectly with a corresponding Evaluating the performance of linear regression models.

Evaluated on the training data, the Evaluating the performance of linear regression models of our model is 0.765, which doesn't sound too bad. However, the Evaluating the performance of linear regression models on the test dataset is only 0.673, which we can compute by executing the following code:

>>> from sklearn.metrics import r2_score
>>> print('R^2 train: %.3f, test: %.3f' % 
...       (r2_score(y_train, y_train_pred),
...        r2_score(y_test, y_test_pred)))

Using regularized methods for regression

As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, regularization is one approach to tackle the problem of overfitting by adding additional information, and thereby shrinking the parameter values of the model to induce a penalty against complexity. The most popular approaches to regularized linear regression are the so-called Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO) and Elastic Net method.

Ridge regression is an L2 penalized model where we simply add the squared sum of the weights to our least-squares cost function:

Using regularized methods for regression

Here:

Using regularized methods for regression

By increasing the value of the hyperparameter Using regularized methods for regression, we increase the regularization strength and shrink the weights of our model. Please note that we don't regularize the intercept term Using regularized methods for regression.

An alternative approach that can lead to sparse models is the LASSO. Depending on the regularization strength, certain weights can become zero, which makes the LASSO also useful as a supervised feature selection technique:

Using regularized methods for regression

Here:

Using regularized methods for regression

However, a limitation of the LASSO is that it selects at most Using regularized methods for regression variables if Using regularized methods for regression>Using regularized methods for regression. A compromise between Ridge regression and the LASSO is the Elastic Net, which has a L1 penalty to generate sparsity and a L2 penalty to overcome some of the limitations of the LASSO, such as the number of selected variables.

Using regularized methods for regression

Those regularized regression models are all available via scikit-learn, and the usage is similar to the regular regression model except that we have to specify the regularization strength via the parameter Using regularized methods for regression, for example, optimized via k-fold cross-validation.

A Ridge Regression model can be initialized as follows:

>>> from sklearn.linear_model import Ridge
>>> ridge = Ridge(alpha=1.0)

Note that the regularization strength is regulated alpha, which is similar to the parameter Using regularized methods for regression. Likewise, we can initialize a LASSO regressor from the linear_model submodule:

>>> from sklearn.linear_model import Lasso
>>> lasso = Lasso(alpha=1.0)

Lastly, the ElasticNet implementation allows us to vary the L1 to L2 ratio:

>>> from sklearn.linear_model import ElasticNet
>>> lasso = ElasticNet(alpha=1.0, l1_ratio=0.5)

For example, if we set l1_ratio to 1.0, the ElasticNet regressor would be equal to LASSO regression. For more detailed information about the different implementations of linear regression, please see the documentation at http://scikit-learn.org/stable/modules/linear_model.html.

Turning a linear regression model into a curve – polynomial regression

In the previous sections, we assumed a linear relationship between explanatory and response variables. One way to account for the violation of linearity assumption is to use a polynomial regression model by adding polynomial terms:

Turning a linear regression model into a curve – polynomial regression

Here, Turning a linear regression model into a curve – polynomial regression denotes the degree of the polynomial. Although we can use polynomial regression to model a nonlinear relationship, it is still considered a multiple linear regression model because of the linear regression coefficients Turning a linear regression model into a curve – polynomial regression.

We will now discuss how to use the PolynomialFeatures transformer class from scikit-learn to add a quadratic term (Turning a linear regression model into a curve – polynomial regression) to a simple regression problem with one explanatory variable, and compare the polynomial to the linear fit. The steps are as follows:

  1. Add a second degree polynomial term:
    from sklearn.preprocessing import PolynomialFeatures
    >>> X = np.array([258.0, 270.0, 294.0,
    …                          320.0, 342.0, 368.0,
    …                          396.0, 446.0, 480.0,
    …                          586.0])[:, np.newaxis]
    
    >>> y = np.array([236.4, 234.4, 252.8,
    …                         298.6, 314.2, 342.2,
    …                         360.8, 368.0, 391.2,
    …                         390.8])
    >>> lr = LinearRegression()
    >>> pr = LinearRegression()
    >>> quadratic = PolynomialFeatures(degree=2)
    >>> X_quad = quadratic.fit_transform(X)
  2. Fit a simple linear regression model for comparison:
    >>> lr.fit(X, y)
    >>> X_fit = np.arange(250,600,10)[:, np.newaxis]
    >>> y_lin_fit = lr.predict(X_fit)
  3. Fit a multiple regression model on the transformed features for polynomial regression:
    >>> pr.fit(X_quad, y)
    >>> y_quad_fit = pr.predict(quadratic.fit_transform(X_fit))
    Plot the results:
    >>> plt.scatter(X, y, label='training points')
    >>> plt.plot(X_fit, y_lin_fit, 
    ...          label='linear fit', linestyle='--')
    >>> plt.plot(X_fit, y_quad_fit,
    ...          label='quadratic fit')
    >>> plt.legend(loc='upper left')
    >>> plt.show()

In the resulting plot, we can see that the polynomial fit captures the relationship between the response and explanatory variable much better than the linear fit:

Turning a linear regression model into a curve – polynomial regression
>>> y_lin_pred = lr.predict(X)
>>> y_quad_pred = pr.predict(X_quad)
>>> print('Training MSE linear: %.3f, quadratic: %.3f' % (
...         mean_squared_error(y, y_lin_pred),
...         mean_squared_error(y, y_quad_pred)))
Training MSE linear: 569.780, quadratic: 61.330
>>> print('Training  R^2 linear: %.3f, quadratic: %.3f' % (
...         r2_score(y, y_lin_pred),
...         r2_score(y, y_quad_pred)))
Training  R^2 linear: 0.832, quadratic: 0.982

As we can see after executing the preceding code, the MSE decreased from 570 (linear fit) to 61 (quadratic fit), and the coefficient of determination reflects a closer fit to the quadratic model (Turning a linear regression model into a curve – polynomial regression) as opposed to the linear fit (Turning a linear regression model into a curve – polynomial regression) in this particular toy problem.

Modeling nonlinear relationships in the Housing Dataset

After we discussed how to construct polynomial features to fit nonlinear relationships in a toy problem, let's now take a look at a more concrete example and apply those concepts to the data in the Housing Dataset. By executing the following code, we will model the relationship between house prices and LSTAT (percent lower status of the population) using second degree (quadratic) and third degree (cubic) polynomials and compare it to a linear fit.

The code is as follows:

>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> regr = LinearRegression()

# create polynomial features
>>> quadratic = PolynomialFeatures(degree=2)
>>> cubic = PolynomialFeatures(degree=3)
>>> X_quad = quadratic.fit_transform(X)
>>> X_cubic = cubic.fit_transform(X)

# linear fit
>>> X_fit = np.arange(X.min(), X.max(), 1)[:, np.newaxis]
>>> regr = regr.fit(X, y)
>>> y_lin_fit = regr.predict(X_fit)
>>> linear_r2 = r2_score(y, regr.predict(X))

# quadratic fit
>>> regr = regr.fit(X_quad, y)
>>> y_quad_fit = regr.predict(quadratic.fit_transform(X_fit))
>>> quadratic_r2 = r2_score(y, regr.predict(X_quad))

# cubic fit
>>> regr = regr.fit(X_cubic, y)
>>> y_cubic_fit = regr.predict(cubic.fit_transform(X_fit))
>>> cubic_r2 = r2_score(y, regr.predict(X_cubic))

# plot results
>>> plt.scatter(X, y, 
...             label='training points', 
...             color='lightgray')
>>> plt.plot(X_fit, y_lin_fit, 
...          label='linear (d=1), $R^2=%.2f$' 
...            % linear_r2, 
...          color='blue', 
...          lw=2, 
...          linestyle=':')
>>> plt.plot(X_fit, y_quad_fit, 
...          label='quadratic (d=2), $R^2=%.2f$' 
...            % quadratic_r2,
...          color='red', 
...          lw=2,
...          linestyle='-')
>>> plt.plot(X_fit, y_cubic_fit, 
...          label='cubic (d=3), $R^2=%.2f$' 
...            % cubic_r2,
...          color='green', 
...          lw=2, 
...          linestyle='--')
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.legend(loc='upper right')
>>> plt.show()

As we can see in the resulting plot, the cubic fit captures the relationship between the house prices and LSTAT better than the linear and quadratic fit. However, we should be aware that adding more and more polynomial features increases the complexity of a model and therefore increases the chance of overfitting. Thus, in practice, it is always recommended that you evaluate the performance of the model on a separate test dataset to estimate the generalization performance:

Modeling nonlinear relationships in the Housing Dataset

In addition, polynomial features are not always the best choice for modeling nonlinear relationships. For example, just by looking at the MEDV-LSTAT scatterplot, we could propose that a log transformation of the LSTAT feature variable and the square root of MEDV may project the data onto a linear feature space suitable for a linear regression fit. Let's test this hypothesis by executing the following code:

# transform features
>>> X_log = np.log(X)
>>> y_sqrt = np.sqrt(y)

# fit features
>>> X_fit = np.arange(X_log.min()-1, 
...                   X_log.max()+1, 1)[:, np.newaxis]
>>> regr = regr.fit(X_log, y_sqrt)
>>> y_lin_fit = regr.predict(X_fit)
>>> linear_r2 = r2_score(y_sqrt, regr.predict(X_log))

# plot results
>>> plt.scatter(X_log, y_sqrt,
...             label='training points',
...             color='lightgray')
>>> plt.plot(X_fit, y_lin_fit, 
...          label='linear (d=1), $R^2=%.2f$' % linear_r2, 
...          color='blue', 
...          lw=2)
>>> plt.xlabel('log(% lower status of the population [LSTAT])')
>>> plt.ylabel('$\sqrt{Price \; in \; \$1000\'s [MEDV]}$')
>>> plt.legend(loc='lower left')
>>> plt.show()

After transforming the explanatory onto the log space and taking the square root of the target variables, we were able to capture the relationship between the two variables with a linear regression line that seems to fit the data better (Modeling nonlinear relationships in the Housing Dataset) than any of the polynomial feature transformations previously:

Modeling nonlinear relationships in the Housing Dataset

Dealing with nonlinear relationships using random forests

In this section, we are going to take a look at random forest regression, which is conceptually different from the previous regression models in this chapter. A random forest, which is an ensemble of multiple decision trees, can be understood as the sum of piecewise linear functions in contrast to the global linear and polynomial regression models that we discussed previously. In other words, via the decision tree algorithm, we are subdividing the input space into smaller regions that become more manageable.

Decision tree regression

An advantage of the decision tree algorithm is that it does not require any transformation of the features if we are dealing with nonlinear data. We remember from Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, that we grow a decision tree by iteratively splitting its nodes until the leaves are pure or a stopping criterion is satisfied. When we used decision trees for classification, we defined entropy as a measure of impurity to determine which feature split maximizes the Information Gain (IG), which can be defined as follows for a binary split:

Decision tree regression

Here, Decision tree regression is the feature to perform the split, Decision tree regression is the number of samples in the parent node, Decision tree regression is the impurity function, Decision tree regression is the subset of training samples in the parent node, and Decision tree regression and Decision tree regression are the subsets of training samples in the left and right child node after the split. Remember that our goal is to find the feature split that maximizes the information gain, or in other words, we want to find the feature split that reduces the impurities in the child nodes. In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we used entropy as a measure of impurity, which is a useful criterion for classification. To use a decision tree for regression, we will replace entropy as the impurity measure of a node Decision tree regression by the MSE:

Decision tree regression

Here, Decision tree regression is the number of training samples at node Decision tree regression, Decision tree regression is the training subset at node Decision tree regression, Decision tree regression is the true target value, and Decision tree regression is the predicted target value (sample mean):

Decision tree regression

In the context of decision tree regression, the MSE is often also referred to as within-node variance, which is why the splitting criterion is also better known as variance reduction. To see what the line fit of a decision tree looks like, let's use the DecisionTreeRegressor implemented in scikit-learn to model the nonlinear relationship between the MEDV and LSTAT variables:

>>> from sklearn.tree import DecisionTreeRegressor
>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> tree = DecisionTreeRegressor(max_depth=3)
>>> tree.fit(X, y)
>>> sort_idx = X.flatten().argsort()
   >>> lin_regplot(X[sort_idx], y[sort_idx], tree)
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.show()

As we can see from the resulting plot, the decision tree captures the general trend in the data. However, a limitation of this model is that it does not capture the continuity and differentiability of the desired prediction. In addition, we need to be careful about choosing an appropriate value for the depth of the tree to not overfit or underfit the data; here, a depth of 3 seems to be a good choice:

Decision tree regression

In the next section, we will take a look at a more robust way for fitting regression trees: random forests.

Random forest regression

As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, the random forest algorithm is an ensemble technique that combines multiple decision trees. A random forest usually has a better generalization performance than an individual decision tree due to randomness that helps to decrease the model variance. Other advantages of random forests are that they are less sensitive to outliers in the dataset and don't require much parameter tuning. The only parameter in random forests that we typically need to experiment with is the number of trees in the ensemble. The basic random forests algorithm for regression is almost identical to the random forest algorithm for classification that we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. The only difference is that we use the MSE criterion to grow the individual decision trees, and the predicted target variable is calculated as the average prediction over all decision trees.

Now, let's use all the features in the Housing Dataset to fit a random forest regression model on 60 percent of the samples and evaluate its performance on the remaining 40 percent. The code is as follows:


>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test =\
...       train_test_split(X, y, 
...                        test_size=0.4, 
...                        random_state=1)

>>> from sklearn.ensemble import RandomForestRegressor
>>> forest = RandomForestRegressor(                             ...                                n_estimators=1000, 
...                                criterion='mse', 
...                                random_state=1, 
...                                n_jobs=-1)
>>> forest.fit(X_train, y_train)
>>> y_train_pred = forest.predict(X_train)
>>> y_test_pred = forest.predict(X_test)
>>> print('MSE train: %.3f, test: %.3f' % (
...        mean_squared_error(y_train, y_train_pred),
...        mean_squared_error(y_test, y_test_pred)))
>>> print('R^2 train: %.3f, test: %.3f' % (
...        r2_score(y_train, y_train_pred),
...        r2_score(y_test, y_test_pred)))
MSE train: 3.235, test: 11.635
R^2 train: 0.960, test: 0.871

Unfortunately, we see that the random forest tends to overfit the training data. However, it's still able to explain the relationship between the target and explanatory variables relatively well (Random forest regression on the test dataset).

Lastly, let's also take a look at the residuals of the prediction:

>>> plt.scatter(y_train_pred,  
...             y_train_pred - y_train, 
...             c='black', 
...             marker='o', 
...             s=35,
...             alpha=0.5,
...             label='Training data')
>>> plt.scatter(y_test_pred,  
...             y_test_pred - y_test, 
...             c='lightgreen', 
...             marker='s', 
...             s=35,
...             alpha=0.7,
...             label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
>>> plt.xlim([-10, 50])
>>> plt.show()

As it was already summarized by the Random forest regression coefficient, we can see that the model fits the training data better than the test data, as indicated by the outliers in the y axis direction. Also, the distribution of the residuals does not seem to be completely random around the zero center point, indicating that the model is not able to capture all the exploratory information. However, the residual plot indicates a large improvement over the residual plot of the linear model that we plotted earlier in this chapter:

Random forest regression

Note

In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we also discussed the kernel trick that can be used in combination with support vector machine (SVM) for classification, which is useful if we are dealing with nonlinear problems. Although a discussion is beyond of the scope of this book, SVMs can also be used in nonlinear regression tasks. The interested reader can find more information about Support Vector Machines for regression in an excellent report by S. R. Gunn: S. R. Gunn et al. Support Vector Machines for Classification and Regression. (ISIS technical report, 14, 1998). An SVM regressor is also implemented in scikit-learn, and more information about its usage can be found at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR.

Modeling nonlinear relationships in the Housing Dataset

After we discussed how to construct polynomial features to fit nonlinear relationships in a toy problem, let's now take a look at a more concrete example and apply those concepts to the data in the Housing Dataset. By executing the following code, we will model the relationship between house prices and LSTAT (percent lower status of the population) using second degree (quadratic) and third degree (cubic) polynomials and compare it to a linear fit.

The code is as follows:

>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> regr = LinearRegression()

# create polynomial features
>>> quadratic = PolynomialFeatures(degree=2)
>>> cubic = PolynomialFeatures(degree=3)
>>> X_quad = quadratic.fit_transform(X)
>>> X_cubic = cubic.fit_transform(X)

# linear fit
>>> X_fit = np.arange(X.min(), X.max(), 1)[:, np.newaxis]
>>> regr = regr.fit(X, y)
>>> y_lin_fit = regr.predict(X_fit)
>>> linear_r2 = r2_score(y, regr.predict(X))

# quadratic fit
>>> regr = regr.fit(X_quad, y)
>>> y_quad_fit = regr.predict(quadratic.fit_transform(X_fit))
>>> quadratic_r2 = r2_score(y, regr.predict(X_quad))

# cubic fit
>>> regr = regr.fit(X_cubic, y)
>>> y_cubic_fit = regr.predict(cubic.fit_transform(X_fit))
>>> cubic_r2 = r2_score(y, regr.predict(X_cubic))

# plot results
>>> plt.scatter(X, y, 
...             label='training points', 
...             color='lightgray')
>>> plt.plot(X_fit, y_lin_fit, 
...          label='linear (d=1), $R^2=%.2f$' 
...            % linear_r2, 
...          color='blue', 
...          lw=2, 
...          linestyle=':')
>>> plt.plot(X_fit, y_quad_fit, 
...          label='quadratic (d=2), $R^2=%.2f$' 
...            % quadratic_r2,
...          color='red', 
...          lw=2,
...          linestyle='-')
>>> plt.plot(X_fit, y_cubic_fit, 
...          label='cubic (d=3), $R^2=%.2f$' 
...            % cubic_r2,
...          color='green', 
...          lw=2, 
...          linestyle='--')
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.legend(loc='upper right')
>>> plt.show()

As we can see in the resulting plot, the cubic fit captures the relationship between the house prices and LSTAT better than the linear and quadratic fit. However, we should be aware that adding more and more polynomial features increases the complexity of a model and therefore increases the chance of overfitting. Thus, in practice, it is always recommended that you evaluate the performance of the model on a separate test dataset to estimate the generalization performance:

Modeling nonlinear relationships in the Housing Dataset

In addition, polynomial features are not always the best choice for modeling nonlinear relationships. For example, just by looking at the MEDV-LSTAT scatterplot, we could propose that a log transformation of the LSTAT feature variable and the square root of MEDV may project the data onto a linear feature space suitable for a linear regression fit. Let's test this hypothesis by executing the following code:

# transform features
>>> X_log = np.log(X)
>>> y_sqrt = np.sqrt(y)

# fit features
>>> X_fit = np.arange(X_log.min()-1, 
...                   X_log.max()+1, 1)[:, np.newaxis]
>>> regr = regr.fit(X_log, y_sqrt)
>>> y_lin_fit = regr.predict(X_fit)
>>> linear_r2 = r2_score(y_sqrt, regr.predict(X_log))

# plot results
>>> plt.scatter(X_log, y_sqrt,
...             label='training points',
...             color='lightgray')
>>> plt.plot(X_fit, y_lin_fit, 
...          label='linear (d=1), $R^2=%.2f$' % linear_r2, 
...          color='blue', 
...          lw=2)
>>> plt.xlabel('log(% lower status of the population [LSTAT])')
>>> plt.ylabel('$\sqrt{Price \; in \; \$1000\'s [MEDV]}$')
>>> plt.legend(loc='lower left')
>>> plt.show()

After transforming the explanatory onto the log space and taking the square root of the target variables, we were able to capture the relationship between the two variables with a linear regression line that seems to fit the data better (Modeling nonlinear relationships in the Housing Dataset) than any of the polynomial feature transformations previously:

Modeling nonlinear relationships in the Housing Dataset

Dealing with nonlinear relationships using random forests

In this section, we are going to take a look at random forest regression, which is conceptually different from the previous regression models in this chapter. A random forest, which is an ensemble of multiple decision trees, can be understood as the sum of piecewise linear functions in contrast to the global linear and polynomial regression models that we discussed previously. In other words, via the decision tree algorithm, we are subdividing the input space into smaller regions that become more manageable.

Decision tree regression

An advantage of the decision tree algorithm is that it does not require any transformation of the features if we are dealing with nonlinear data. We remember from Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, that we grow a decision tree by iteratively splitting its nodes until the leaves are pure or a stopping criterion is satisfied. When we used decision trees for classification, we defined entropy as a measure of impurity to determine which feature split maximizes the Information Gain (IG), which can be defined as follows for a binary split:

Decision tree regression

Here, Decision tree regression is the feature to perform the split, Decision tree regression is the number of samples in the parent node, Decision tree regression is the impurity function, Decision tree regression is the subset of training samples in the parent node, and Decision tree regression and Decision tree regression are the subsets of training samples in the left and right child node after the split. Remember that our goal is to find the feature split that maximizes the information gain, or in other words, we want to find the feature split that reduces the impurities in the child nodes. In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we used entropy as a measure of impurity, which is a useful criterion for classification. To use a decision tree for regression, we will replace entropy as the impurity measure of a node Decision tree regression by the MSE:

Decision tree regression

Here, Decision tree regression is the number of training samples at node Decision tree regression, Decision tree regression is the training subset at node Decision tree regression, Decision tree regression is the true target value, and Decision tree regression is the predicted target value (sample mean):

Decision tree regression

In the context of decision tree regression, the MSE is often also referred to as within-node variance, which is why the splitting criterion is also better known as variance reduction. To see what the line fit of a decision tree looks like, let's use the DecisionTreeRegressor implemented in scikit-learn to model the nonlinear relationship between the MEDV and LSTAT variables:

>>> from sklearn.tree import DecisionTreeRegressor
>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> tree = DecisionTreeRegressor(max_depth=3)
>>> tree.fit(X, y)
>>> sort_idx = X.flatten().argsort()
   >>> lin_regplot(X[sort_idx], y[sort_idx], tree)
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.show()

As we can see from the resulting plot, the decision tree captures the general trend in the data. However, a limitation of this model is that it does not capture the continuity and differentiability of the desired prediction. In addition, we need to be careful about choosing an appropriate value for the depth of the tree to not overfit or underfit the data; here, a depth of 3 seems to be a good choice:

Decision tree regression

In the next section, we will take a look at a more robust way for fitting regression trees: random forests.

Random forest regression

As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, the random forest algorithm is an ensemble technique that combines multiple decision trees. A random forest usually has a better generalization performance than an individual decision tree due to randomness that helps to decrease the model variance. Other advantages of random forests are that they are less sensitive to outliers in the dataset and don't require much parameter tuning. The only parameter in random forests that we typically need to experiment with is the number of trees in the ensemble. The basic random forests algorithm for regression is almost identical to the random forest algorithm for classification that we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. The only difference is that we use the MSE criterion to grow the individual decision trees, and the predicted target variable is calculated as the average prediction over all decision trees.

Now, let's use all the features in the Housing Dataset to fit a random forest regression model on 60 percent of the samples and evaluate its performance on the remaining 40 percent. The code is as follows:


>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test =\
...       train_test_split(X, y, 
...                        test_size=0.4, 
...                        random_state=1)

>>> from sklearn.ensemble import RandomForestRegressor
>>> forest = RandomForestRegressor(                             ...                                n_estimators=1000, 
...                                criterion='mse', 
...                                random_state=1, 
...                                n_jobs=-1)
>>> forest.fit(X_train, y_train)
>>> y_train_pred = forest.predict(X_train)
>>> y_test_pred = forest.predict(X_test)
>>> print('MSE train: %.3f, test: %.3f' % (
...        mean_squared_error(y_train, y_train_pred),
...        mean_squared_error(y_test, y_test_pred)))
>>> print('R^2 train: %.3f, test: %.3f' % (
...        r2_score(y_train, y_train_pred),
...        r2_score(y_test, y_test_pred)))
MSE train: 3.235, test: 11.635
R^2 train: 0.960, test: 0.871

Unfortunately, we see that the random forest tends to overfit the training data. However, it's still able to explain the relationship between the target and explanatory variables relatively well (Random forest regression on the test dataset).

Lastly, let's also take a look at the residuals of the prediction:

>>> plt.scatter(y_train_pred,  
...             y_train_pred - y_train, 
...             c='black', 
...             marker='o', 
...             s=35,
...             alpha=0.5,
...             label='Training data')
>>> plt.scatter(y_test_pred,  
...             y_test_pred - y_test, 
...             c='lightgreen', 
...             marker='s', 
...             s=35,
...             alpha=0.7,
...             label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
>>> plt.xlim([-10, 50])
>>> plt.show()

As it was already summarized by the Random forest regression coefficient, we can see that the model fits the training data better than the test data, as indicated by the outliers in the y axis direction. Also, the distribution of the residuals does not seem to be completely random around the zero center point, indicating that the model is not able to capture all the exploratory information. However, the residual plot indicates a large improvement over the residual plot of the linear model that we plotted earlier in this chapter:

Random forest regression

Note

In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we also discussed the kernel trick that can be used in combination with support vector machine (SVM) for classification, which is useful if we are dealing with nonlinear problems. Although a discussion is beyond of the scope of this book, SVMs can also be used in nonlinear regression tasks. The interested reader can find more information about Support Vector Machines for regression in an excellent report by S. R. Gunn: S. R. Gunn et al. Support Vector Machines for Classification and Regression. (ISIS technical report, 14, 1998). An SVM regressor is also implemented in scikit-learn, and more information about its usage can be found at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR.

Dealing with nonlinear relationships using random forests

In this section, we are going to take a look at random forest regression, which is conceptually different from the previous regression models in this chapter. A random forest, which is an ensemble of multiple decision trees, can be understood as the sum of piecewise linear functions in contrast to the global linear and polynomial regression models that we discussed previously. In other words, via the decision tree algorithm, we are subdividing the input space into smaller regions that become more manageable.

Decision tree regression

An advantage of the decision tree algorithm is that it does not require any transformation of the features if we are dealing with nonlinear data. We remember from Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, that we grow a decision tree by iteratively splitting its nodes until the leaves are pure or a stopping criterion is satisfied. When we used decision trees for classification, we defined entropy as a measure of impurity to determine which feature split maximizes the Information Gain (IG), which can be defined as follows for a binary split:

Decision tree regression

Here, Decision tree regression is the feature to perform the split, Decision tree regression is the number of samples in the parent node, Decision tree regression is the impurity function, Decision tree regression is the subset of training samples in the parent node, and Decision tree regression and Decision tree regression are the subsets of training samples in the left and right child node after the split. Remember that our goal is to find the feature split that maximizes the information gain, or in other words, we want to find the feature split that reduces the impurities in the child nodes. In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we used entropy as a measure of impurity, which is a useful criterion for classification. To use a decision tree for regression, we will replace entropy as the impurity measure of a node Decision tree regression by the MSE:

Decision tree regression

Here, Decision tree regression is the number of training samples at node Decision tree regression, Decision tree regression is the training subset at node Decision tree regression, Decision tree regression is the true target value, and Decision tree regression is the predicted target value (sample mean):

Decision tree regression

In the context of decision tree regression, the MSE is often also referred to as within-node variance, which is why the splitting criterion is also better known as variance reduction. To see what the line fit of a decision tree looks like, let's use the DecisionTreeRegressor implemented in scikit-learn to model the nonlinear relationship between the MEDV and LSTAT variables:

>>> from sklearn.tree import DecisionTreeRegressor
>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> tree = DecisionTreeRegressor(max_depth=3)
>>> tree.fit(X, y)
>>> sort_idx = X.flatten().argsort()
   >>> lin_regplot(X[sort_idx], y[sort_idx], tree)
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.show()

As we can see from the resulting plot, the decision tree captures the general trend in the data. However, a limitation of this model is that it does not capture the continuity and differentiability of the desired prediction. In addition, we need to be careful about choosing an appropriate value for the depth of the tree to not overfit or underfit the data; here, a depth of 3 seems to be a good choice:

Decision tree regression

In the next section, we will take a look at a more robust way for fitting regression trees: random forests.

Random forest regression

As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, the random forest algorithm is an ensemble technique that combines multiple decision trees. A random forest usually has a better generalization performance than an individual decision tree due to randomness that helps to decrease the model variance. Other advantages of random forests are that they are less sensitive to outliers in the dataset and don't require much parameter tuning. The only parameter in random forests that we typically need to experiment with is the number of trees in the ensemble. The basic random forests algorithm for regression is almost identical to the random forest algorithm for classification that we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. The only difference is that we use the MSE criterion to grow the individual decision trees, and the predicted target variable is calculated as the average prediction over all decision trees.

Now, let's use all the features in the Housing Dataset to fit a random forest regression model on 60 percent of the samples and evaluate its performance on the remaining 40 percent. The code is as follows:


>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test =\
...       train_test_split(X, y, 
...                        test_size=0.4, 
...                        random_state=1)

>>> from sklearn.ensemble import RandomForestRegressor
>>> forest = RandomForestRegressor(                             ...                                n_estimators=1000, 
...                                criterion='mse', 
...                                random_state=1, 
...                                n_jobs=-1)
>>> forest.fit(X_train, y_train)
>>> y_train_pred = forest.predict(X_train)
>>> y_test_pred = forest.predict(X_test)
>>> print('MSE train: %.3f, test: %.3f' % (
...        mean_squared_error(y_train, y_train_pred),
...        mean_squared_error(y_test, y_test_pred)))
>>> print('R^2 train: %.3f, test: %.3f' % (
...        r2_score(y_train, y_train_pred),
...        r2_score(y_test, y_test_pred)))
MSE train: 3.235, test: 11.635
R^2 train: 0.960, test: 0.871

Unfortunately, we see that the random forest tends to overfit the training data. However, it's still able to explain the relationship between the target and explanatory variables relatively well (Random forest regression on the test dataset).

Lastly, let's also take a look at the residuals of the prediction:

>>> plt.scatter(y_train_pred,  
...             y_train_pred - y_train, 
...             c='black', 
...             marker='o', 
...             s=35,
...             alpha=0.5,
...             label='Training data')
>>> plt.scatter(y_test_pred,  
...             y_test_pred - y_test, 
...             c='lightgreen', 
...             marker='s', 
...             s=35,
...             alpha=0.7,
...             label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
>>> plt.xlim([-10, 50])
>>> plt.show()

As it was already summarized by the Random forest regression coefficient, we can see that the model fits the training data better than the test data, as indicated by the outliers in the y axis direction. Also, the distribution of the residuals does not seem to be completely random around the zero center point, indicating that the model is not able to capture all the exploratory information. However, the residual plot indicates a large improvement over the residual plot of the linear model that we plotted earlier in this chapter:

Random forest regression

Note

In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we also discussed the kernel trick that can be used in combination with support vector machine (SVM) for classification, which is useful if we are dealing with nonlinear problems. Although a discussion is beyond of the scope of this book, SVMs can also be used in nonlinear regression tasks. The interested reader can find more information about Support Vector Machines for regression in an excellent report by S. R. Gunn: S. R. Gunn et al. Support Vector Machines for Classification and Regression. (ISIS technical report, 14, 1998). An SVM regressor is also implemented in scikit-learn, and more information about its usage can be found at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR.

Decision tree regression

An advantage of the decision tree algorithm is that it does not require any transformation of the features if we are dealing with nonlinear data. We remember from Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, that we grow a decision tree by iteratively splitting its nodes until the leaves are pure or a stopping criterion is satisfied. When we used decision trees for classification, we defined entropy as a measure of impurity to determine which feature split maximizes the Information Gain (IG), which can be defined as follows for a binary split:

Decision tree regression

Here, Decision tree regression is the feature to perform the split, Decision tree regression is the number of samples in the parent node, Decision tree regression is the impurity function, Decision tree regression is the subset of training samples in the parent node, and Decision tree regression and Decision tree regression are the subsets of training samples in the left and right child node after the split. Remember that our goal is to find the feature split that maximizes the information gain, or in other words, we want to find the feature split that reduces the impurities in the child nodes. In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we used entropy as a measure of impurity, which is a useful criterion for classification. To use a decision tree for regression, we will replace entropy as the impurity measure of a node Decision tree regression by the MSE:

Decision tree regression

Here, Decision tree regression is the number of training samples at node Decision tree regression, Decision tree regression is the training subset at node Decision tree regression, Decision tree regression is the true target value, and Decision tree regression is the predicted target value (sample mean):

Decision tree regression

In the context of decision tree regression, the MSE is often also referred to as within-node variance, which is why the splitting criterion is also better known as variance reduction. To see what the line fit of a decision tree looks like, let's use the DecisionTreeRegressor implemented in scikit-learn to model the nonlinear relationship between the MEDV and LSTAT variables:

>>> from sklearn.tree import DecisionTreeRegressor
>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> tree = DecisionTreeRegressor(max_depth=3)
>>> tree.fit(X, y)
>>> sort_idx = X.flatten().argsort()
   >>> lin_regplot(X[sort_idx], y[sort_idx], tree)
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000\'s [MEDV]')
>>> plt.show()

As we can see from the resulting plot, the decision tree captures the general trend in the data. However, a limitation of this model is that it does not capture the continuity and differentiability of the desired prediction. In addition, we need to be careful about choosing an appropriate value for the depth of the tree to not overfit or underfit the data; here, a depth of 3 seems to be a good choice:

Decision tree regression

In the next section, we will take a look at a more robust way for fitting regression trees: random forests.

Random forest regression

As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, the random forest algorithm is an ensemble technique that combines multiple decision trees. A random forest usually has a better generalization performance than an individual decision tree due to randomness that helps to decrease the model variance. Other advantages of random forests are that they are less sensitive to outliers in the dataset and don't require much parameter tuning. The only parameter in random forests that we typically need to experiment with is the number of trees in the ensemble. The basic random forests algorithm for regression is almost identical to the random forest algorithm for classification that we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. The only difference is that we use the MSE criterion to grow the individual decision trees, and the predicted target variable is calculated as the average prediction over all decision trees.

Now, let's use all the features in the Housing Dataset to fit a random forest regression model on 60 percent of the samples and evaluate its performance on the remaining 40 percent. The code is as follows:


>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test =\
...       train_test_split(X, y, 
...                        test_size=0.4, 
...                        random_state=1)

>>> from sklearn.ensemble import RandomForestRegressor
>>> forest = RandomForestRegressor(                             ...                                n_estimators=1000, 
...                                criterion='mse', 
...                                random_state=1, 
...                                n_jobs=-1)
>>> forest.fit(X_train, y_train)
>>> y_train_pred = forest.predict(X_train)
>>> y_test_pred = forest.predict(X_test)
>>> print('MSE train: %.3f, test: %.3f' % (
...        mean_squared_error(y_train, y_train_pred),
...        mean_squared_error(y_test, y_test_pred)))
>>> print('R^2 train: %.3f, test: %.3f' % (
...        r2_score(y_train, y_train_pred),
...        r2_score(y_test, y_test_pred)))
MSE train: 3.235, test: 11.635
R^2 train: 0.960, test: 0.871

Unfortunately, we see that the random forest tends to overfit the training data. However, it's still able to explain the relationship between the target and explanatory variables relatively well (Random forest regression on the test dataset).

Lastly, let's also take a look at the residuals of the prediction:

>>> plt.scatter(y_train_pred,  
...             y_train_pred - y_train, 
...             c='black', 
...             marker='o', 
...             s=35,
...             alpha=0.5,
...             label='Training data')
>>> plt.scatter(y_test_pred,  
...             y_test_pred - y_test, 
...             c='lightgreen', 
...             marker='s', 
...             s=35,
...             alpha=0.7,
...             label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
>>> plt.xlim([-10, 50])
>>> plt.show()

As it was already summarized by the Random forest regression coefficient, we can see that the model fits the training data better than the test data, as indicated by the outliers in the y axis direction. Also, the distribution of the residuals does not seem to be completely random around the zero center point, indicating that the model is not able to capture all the exploratory information. However, the residual plot indicates a large improvement over the residual plot of the linear model that we plotted earlier in this chapter:

Random forest regression

Note

In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we also discussed the kernel trick that can be used in combination with support vector machine (SVM) for classification, which is useful if we are dealing with nonlinear problems. Although a discussion is beyond of the scope of this book, SVMs can also be used in nonlinear regression tasks. The interested reader can find more information about Support Vector Machines for regression in an excellent report by S. R. Gunn: S. R. Gunn et al. Support Vector Machines for Classification and Regression. (ISIS technical report, 14, 1998). An SVM regressor is also implemented in scikit-learn, and more information about its usage can be found at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR.

Random forest regression

As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, the random forest algorithm is an ensemble technique that combines multiple decision trees. A random forest usually has a better generalization performance than an individual decision tree due to randomness that helps to decrease the model variance. Other advantages of random forests are that they are less sensitive to outliers in the dataset and don't require much parameter tuning. The only parameter in random forests that we typically need to experiment with is the number of trees in the ensemble. The basic random forests algorithm for regression is almost identical to the random forest algorithm for classification that we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. The only difference is that we use the MSE criterion to grow the individual decision trees, and the predicted target variable is calculated as the average prediction over all decision trees.

Now, let's use all the features in the Housing Dataset to fit a random forest regression model on 60 percent of the samples and evaluate its performance on the remaining 40 percent. The code is as follows:


>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test =\
...       train_test_split(X, y, 
...                        test_size=0.4, 
...                        random_state=1)

>>> from sklearn.ensemble import RandomForestRegressor
>>> forest = RandomForestRegressor(                             ...                                n_estimators=1000, 
...                                criterion='mse', 
...                                random_state=1, 
...                                n_jobs=-1)
>>> forest.fit(X_train, y_train)
>>> y_train_pred = forest.predict(X_train)
>>> y_test_pred = forest.predict(X_test)
>>> print('MSE train: %.3f, test: %.3f' % (
...        mean_squared_error(y_train, y_train_pred),
...        mean_squared_error(y_test, y_test_pred)))
>>> print('R^2 train: %.3f, test: %.3f' % (
...        r2_score(y_train, y_train_pred),
...        r2_score(y_test, y_test_pred)))
MSE train: 3.235, test: 11.635
R^2 train: 0.960, test: 0.871

Unfortunately, we see that the random forest tends to overfit the training data. However, it's still able to explain the relationship between the target and explanatory variables relatively well (Random forest regression on the test dataset).

Lastly, let's also take a look at the residuals of the prediction:

>>> plt.scatter(y_train_pred,  
...             y_train_pred - y_train, 
...             c='black', 
...             marker='o', 
...             s=35,
...             alpha=0.5,
...             label='Training data')
>>> plt.scatter(y_test_pred,  
...             y_test_pred - y_test, 
...             c='lightgreen', 
...             marker='s', 
...             s=35,
...             alpha=0.7,
...             label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
>>> plt.xlim([-10, 50])
>>> plt.show()

As it was already summarized by the Random forest regression coefficient, we can see that the model fits the training data better than the test data, as indicated by the outliers in the y axis direction. Also, the distribution of the residuals does not seem to be completely random around the zero center point, indicating that the model is not able to capture all the exploratory information. However, the residual plot indicates a large improvement over the residual plot of the linear model that we plotted earlier in this chapter:

Random forest regression

Note

In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we also discussed the kernel trick that can be used in combination with support vector machine (SVM) for classification, which is useful if we are dealing with nonlinear problems. Although a discussion is beyond of the scope of this book, SVMs can also be used in nonlinear regression tasks. The interested reader can find more information about Support Vector Machines for regression in an excellent report by S. R. Gunn: S. R. Gunn et al. Support Vector Machines for Classification and Regression. (ISIS technical report, 14, 1998). An SVM regressor is also implemented in scikit-learn, and more information about its usage can be found at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR.

Summary

At the beginning of this chapter, you learned about using simple linear regression analysis to model the relationship between a single explanatory variable and a continuous response variable. We then discussed a useful explanatory data analysis technique to look at patterns and anomalies in data, which is an important first step in predictive modeling tasks.

We built our first model by implementing linear regression using a gradient-based optimization approach. We then saw how to utilize scikit-learn's linear models for regression and also implement a robust regression technique (RANSAC) as an approach for dealing with outliers. To assess the predictive performance of regression models, we computed the mean sum of squared errors and the related Summary metric. Furthermore, we also discussed a useful graphical approach to diagnose the problems of regression models: the residual plot.

After we discussed how regularization can be applied to regression models to reduce the model complexity and avoid overfitting, we also introduced several approaches to model nonlinear relationships, including polynomial feature transformation and random forest regressors.

We discussed supervised learning, classification, and regression analysis, in great detail throughout the previous chapters. In the next chapter, we are going to discuss another interesting subfield of machine learning: unsupervised learning. In the next chapter, you will learn how to use cluster analysis for finding hidden structures in data in the absence of target variables.

You have been reading a chapter from
Python: Deeper Insights into Machine Learning
Published in: Aug 2016
Publisher: Packt
ISBN-13: 9781787128576
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime