Linear models
We’ve already introduced, at a high level, the idea of OLS regression for a linear model. But this particular combination of squared loss for measuring the risk and a linear model for ˆ y has some very convenient and simple-to-use properties. This simplicity means that OLS regression is one of the most widely used and studied data science modeling techniques. That is why we are going to look in detail at fitting linear models to data using OLS regression.
To start with, we’ll revisit the squared-loss empirical risk function in Eq. 10 and look at what happens to it when we have a linear model ˆ y . To recap, the squared-loss empirical risk is given by the following:
Eq. 13
Now, for a linear model with features, , we can write the model as follows:
Eq. 14
The vector of model parameters is . We can write the features in vector form as well. We’ll write it as a row-vector, . Doing so allows us to write Eq. 14 in the following form:
Eq. 15
We can think of the extra 1 in the feature vector as being a feature value that multiplies the intercept in the linear model in Eq. 14. For the datapoint the feature values can be written in vector form, , with obviously for all . We can combine all the feature vectors from all the datapoints into a data matrix:
Eq. 16
If we also put all the observed values, , into a vector , then we can write the risk function in Eq. 13 in a very succinct form as follows:
Eq. 17
The data matrix X _ _ is also called the design matrix. This terminology originates from statistics, where often the datapoints and, hence, feature values were part of a scientific experiment to quantify the various influences on the response variable y. Being part of a scientific experiment, the feature values x ij were planned in advance; that is, designed.
The optimal values of the model parameters β _ are obtained by minimizing the right-hand side of Eq. 17 with respect to β _. We’ll denote the optimal values of β _ by the symbol ˆ β _. We can use the differential calculus we recapped in Chapter 1 to do the minimization. Differentiating the right-hand side of Eq. 17 with respect to β _ and setting the derivatives to zero gives us the following:
Eq. 18
Re-arranging Eq. 18, we get the following:
Eq. 19
We can solve Eq. 19 by applying to both the left- and right-hand sides of Eq. 19 to get the following:
Eq. 20
This solution is very efficient. It is in a closed-form, meaning we have an equation with the thing we want, , on its own on the left-hand side, and a mathematical expression that doesn’t involve on the right-hand side. There is no iterative algorithm required. We just perform a couple of matrix calculations, and we have our optimal parameter estimates . That we can obtain a closed-form expression for the parameter estimates is one of the most attractive aspects of OLS regression and part of the reason it is so widely used. We’ll walk through some code examples in a moment to illustrate how easy it is to perform OLS regression.
Practical issues
This doesn’t mean the closed-form expression in Eq. 20 doesn’t cause problems. Firstly, you’ll recall from Chapter 3 on linear algebra that we can have square matrices that do not have an inverse. It is very possible that the matrix does not exist. This happens when there are linear dependencies between the columns of the design matrix ; for example, if one feature is simply a scaled version of another feature, or where combining several features together gives the same numerical value as another feature. In these circumstances, one or more of the features are redundant since they add no new information.
Secondly, in a modern-day data science setting where we might have many thousands of features in a model, the matrix can be unwieldy to work with if d is of the order of several thousand.
How to deal with these computational issues is beyond the scope of the book, but they are something you should be aware of in case they crop up in a problem you are trying to solve.
The model residuals
Once we have obtained an estimate for the model parameters, using Eq. 20, we can calculate the residuals. If we denote the residual by , then obviously we have the following:
Eq. 21
What happens if we sum up all the residuals? To answer this question, we make use of Eq. 19 and recall that the first row of is all ones if our model has an intercept. So, Eq. 19 tells us the following:
Eq. 22
So, the sum of all the residuals is zero if our model has an intercept.
Let’s see these ideas in action with a code example.
OLS regression code example
The data in the Data/power_plant_output.csv
file in the GitHub repository contains measurements of the power output from electricity generation plants. The power (PE
) is generated from a combination of gas turbines, steam turbines, and heat recovery steam generators, and so is affected by environmental factors in which the turbines operate, such as the ambient temperature (AT
) and the steam turbine exhaust vacuum level (V
). The dataset consists of 9,568 observations of the PE
, AT
, and V
values. The data is a subset of the publicly available dataset held in the UCI Machine Learning Repository (https://archive.ics.uci.edu/datasets). The original data can be found at https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant.
We’ll use the data to build a linear model of the power output PE
as a function of the AT
and V
values. We will build the linear model in two ways – i) using the Python statsmodels
package, ii) using Eq. 20 via an explicit calculation. The following code example can be found in the Code_Examples_Chap4.ipynb
notebook in the GitHub repository. To begin, we need to read in the data:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.formula.api as smf # Read in the raw data df = pd.read_csv("../Data/power_plant_output.csv")
We’ll do a quick inspection of the data. First, we’ll compute some summary statistics of the data:
# Use pd.describe() to get the summary statistics of the data df.describe()
AT |
V |
PE |
|
Count |
9568.000000 |
9568.000000 |
9568.000000 |
Mean |
19.651231 |
54.305804 |
454.365009 |
Std |
7.452473 |
12.707893 |
17.066995 |
Min |
1.810000 |
25.360000 |
420.260000 |
25% |
13.510000 |
41.740000 |
439.750000 |
50% |
20.345000 |
52.080000 |
451.550000 |
75% |
25.720000 |
66.540000 |
468.430000 |
Max |
37.110000 |
81.560000 |
495.760000 |
Table 4.1: Summary statistics for the power-plant dataset
Next, we’ll visualize the relationship between the response variable (the target variable) and the features. We’ll start with the relationship between power output (PE
) and ambient temperature (AT
):
# Scatterplot between the response variable PE and the AT feature. # The linear relationship is clear. plt.scatter(df.AT, df.PE) plt.title('PE vs AT', fontsize=24) plt.xlabel('AT', fontsize=20) plt.ylabel('PE', fontsize=20) plt.xticks(fontsize=18) plt.yticks(fontsize=18) plt.show()
Figure 4.5: Plot of power output (PE) versus ambient temperature (AT)
Now, let’s look at the relationship between power and vacuum (V
):
# Scatterplot between the response variable PE and the V feature. # The linear relationship is clear, but not as strong as the # relationship with the AT feature. plt.scatter(df.V, df.PE) plt.title('PE vs V', fontsize=24) plt.xlabel('V', fontsize=20) plt.ylabel('PE', fontsize=20) plt.xticks(fontsize=18) plt.yticks(fontsize=18) plt.show()
Figure 4.6: Plot of power output (PE) versus vacuum level (V)
Now, we’ll fit a linear model using the statsmodels
package. The linear model formula is specified in statistical notation as PE ∼ AT + V. You can think of it as the statistical formula equivalent of the mathematical formula PE = β 0 + β AT x AT + β V x V. We do this fitting using the following code:
# First we specify the model using statsmodels.formula.api.ols model = smf.ols(formula='PE ~ AT + V', data=df) # Now we fit the model to the data, i.e. we minimize the sum-of- # squared residuals with respect to the model parameters model_result = model.fit() # Now we'll look at a summary of the fitted OLS model model_result.summary()
This gives the following parameter estimates for our linear model:
OLS Regression Results |
coef |
std err |
T |
P>|t| |
[0.025 |
0.975] |
|
Intercept |
505.4774 |
0.240 |
2101.855 |
0.000 |
505.006 |
505.949 |
AT |
-1.7043 |
0.013 |
-134.429 |
0.000 |
-1.729 |
-1.679 |
V |
-0.3245 |
0.007 |
-43.644 |
0.000 |
-0.339 |
-0.310 |
Table 4.2: OLS regression parameter estimates for our power-plant linear model
We can see from Table 4.2 that, as expected, we get negative estimates for the parameters corresponding to the AT
and V
features. Now, we’ll repeat the calculation explicitly using the formula in Eq. 20. We’ll use the linear algebra functions available to us in numpy
. First, we need to extract the data from the pandas
DataFrame to appropriate numpy
arrays:
# We extract the design matrix as a 2D numpy array. # This initially corresponds to the feature columns of the dataframe. # In this case it is all but the last column X = df.iloc[:, 0:(df.shape[1]-1)].to_numpy() # Now we'll add a column of ones to the design matrix. # This is the feature that corresponds to the intercept parameter # in the moddel X = np.c_[np.ones(X.shape[0]), X] # For convenience, we'll create and store the transpose of the # design matrix xT = np.transpose(X) # Now we'll extract the response vector to a numpy array y = df.iloc[:, df.shape[1]-1].to_numpy()
Now, we can calculate the OLS parameter estimates using the formula ::
# Calculate the inverse of xTx using the numpy linear algebra # functions xTx_inv = np.linalg.inv(np.matmul(xT, X)) # Finally calculate the OLS model parameter estimates using the # formula (xTx_inv)*(xT*y). # Again, we use the numpy linear algebra functions to do this ols_params = np.matmul(xTx_inv, np.matmul(xT, y))
We can compare the OLS parameter estimates obtained from statsmodels
with those obtained from the explicit calculation:
# Now compare the parameter estimates from the explicit calculation # with those obtained from the statsmodels fit df_compare = pd.DataFrame({'statsmodels': model_result.params, 'explicit_ols':ols_params}) df_compare
statsmodels |
explicit_ols |
|
Intercept |
505.477434 |
505.477434 |
AT |
-1.704266 |
-1.704266 |
V |
-0.324487 |
-0.324487 |
Table 4.3: A comparison of the parameter estimates from the statsmodels packages and explicit calculation using the OLS formula
The parameter estimates from the two different OLS regression codes are identical to more than 6 decimal places.
This walkthrough of a real example highlights the power of the closed-form OLS regression formula in Eq. 20. This closed-form arises from the linear (in ) nature of the optimality criterion in Eq. 18, which itself arises from the quadratic nature of the risk function in Eq. 17, which ultimately is a consequence of the quadratic form of the squared-loss function in Eq. 8.
But what if we don’t want to use a linear model or a squared-loss function? Firstly, we can’t use OLS regression! Secondly, a different choice of loss function, such as the absolute loss or the pseudo-Huber robust loss function in Eq. 12, will not lead to a closed-form solution for if we minimize the empirical risk in Eq. 3. So, how do we minimize the empirical risk to obtain optimal model parameter estimates in these situations? We’ll learn how to address this question in the next section, but for now, let’s review what we have learned in this section.
What we learned
In this section, we have learned the following:
- How to write the empirical risk for OLS regression in matrix notation
- How to derive a closed-form expression for OLS model parameter estimates
- Some of the properties and practical limitations of OLS regression
- How to perform OLS regression using available Python packages such as
statsmodels
- How to perform OLS regression by explicitly calculating the closed-form formula for OLS model parameter estimates
Having learned how to perform OLS regression, we’ll now learn how to perform least squares regression in more general settings by using gradient descent techniques to minimize the empirical risk.