Packt+ | Advance your knowledge in tech

You're reading from Applied Supervised Learning with Python Use scikit-learn to build predictive models from real-world datasets and prepare yourself for the future of machine learning

Product type Paperback

Published in Apr 2019

Publisher

ISBN-13 9781789954920

Length 404 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (2):

Ishita Mathur

Benjamin Johnston

View More author details

Table of Contents (9) Chapters

Applied Supervised Learning with Python

Preface

1. Python Machine Learning Toolkit FREE CHAPTER

2. Exploratory Data Analysis and Visualization

3. Regression Analysis

4. Classification

5. Ensemble Modeling

6. Model Evaluation

Appendix

Chapter 3: Regression Analysis

Activity 5: Plotting Data with a Moving Average

Solution

Load the dataset into a pandas DataFrame from the CSV file:
```
df = pd.read_csv('austin_weather.csv')
df.head()
```
The output will show the initial five rows of the austin_weather.csv file:
Figure 3.74: The first five rows of the Austin weather data
Since we only need the Date and TempAvgF columns, we'll remove all others from the dataset:
```
df = df[['Date', 'TempAvgF']]
df.head()
```
The output will be:
Figure 3.75: Date and TempAvgF columns of the Austin weather data
Initially, we are only interested in the first year's data, so we need to extract that information only. Create a column in the DataFrame for the year value, extract the year value as an integer from the strings in the Date column, and assign these values to the Year column. Note that temperatures are recorded daily:
```
df['Year'] = [int(dt[:4]) for dt in df.Date]
df.head()
```
The output will be:
Figure 3.76: Extracting the year
Repeat this process to extract the month values and store the values as integers in a Month column:
```
df['Month'] = [int(dt[5:7]) for dt in df.Date]
df.head()
```
The output will be:
Figure 3.77: Extracting the month
Copy the first year's worth of data to a DataFrame:
```
df_first_year = df[:365]
df_first_year.head()
```
The output will be as follows:
Figure 3.78: Copied data to new DataFrame

Compute a 20-day moving average filter:

window = 20
rolling = df_first_year.TempAvgF.rolling(window).mean();
rolling.head(n=20)

The output will be:

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN
9       NaN
10      NaN
11      NaN
12      NaN
13      NaN
14      NaN
15      NaN
16      NaN
17      NaN
18      NaN
19    47.75
Name: TempAvgF, dtype: float64

Plot the raw data and the moving average signal, with the x axis as the day number in the year:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(range(1, 366), df_first_year.TempAvgF, label='Raw Data');
ax.plot(range(1, 366), rolling, c='r', label=f'{window} day moving average');

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.79: Scatter plot of temperature throughout the year

Activity 6: Linear Regression Using the Least Squares Method

Solution

Visualize the measurements:
```
df.head()
```
The output will be as follows:
Figure 3.80: First five rows of activity2_measurements.csv dataset
Visualize the rolling average values:
```
rolling.head(n=30)
```
The output will be as follows:
Figure 3.81: Rolling head average
Create a linear regression model using the default parameters; that is, calculate a y intercept for the model and do not normalize the data:
```
model = LinearRegression()
model
```
The output will be as follows:
```
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
```
Now fit the model, where the input data is the day number for the year (1 to 365) and the output is the average temperature. To make later calculations easier, insert a column (DayOfYear) that corresponds with the day of the year for that measurement:
```
df_first_year.loc[:,'DayOfYear'] = [i + 1 for i in df_first_year.index]
df_first_year.head()
```
The output will be as follows:
Figure 3.82: Adding day of year column

Fit the model with the DayOfYear values as the input and df_first_year.TempAvgF as the output:

# Note the year values need to be provided as an N x 1 array
model.fit(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF)

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Print the parameters of the model:

print(f'm = {model.coef_[0]}')
print(f'c = {model.intercept_}')

print('\nModel Definition')
print(f'y = {model.coef_[0]:0.4}x + {model.intercept_:0.4f}')

The output will be:

m = 0.04909173467448788
c = 60.28196597922625

Model Definition
y = 0.04909x + 60.2820

We can calculate the trendline values by using the first, middle, and last values (days in years) in the linear equation:

trend_x = np.array([
    1,
    182.5,
    365
])

trend_y = model.predict(trend_x.reshape((-1, 1)))
trend_y

The output will be as follows:

array([60.33105771, 69.24120756, 78.20044914])

Plot these values with the trendline:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(trend_x, trend_y, c='k', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.83: Scatterplot of temperature thought the year with the predicted trendline

Evaluate the performance of the model. How well does the model fit the data? Calculate the r2 score to find out:

# Note the year values need to be provided as an N x 1 array
r2 = model.score(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF)
print(f'r2 score = {r2:0.4f}')

The output will be:

r2 score = 0.1222

Activity 7: Dummy Variables

Solution

Plot the raw data (df) and moving average (rolling):

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be:

Figure 3.84: Scatterplot of Temperature throughout the year

Looking at the preceding plot, there seems to be an inflection point around day 250. Create a dummy variable to introduce this feature into the linear model:
```
df_first_year.loc[:,'inflection'] = [1 * int(i < 250) for i in df_first_year.DayOfYear]
```
Check the first and last samples to confirm that the dummy variable is correct. Check the first five samples:
```
df_first_year.head()
```
The output will be as follows:
Figure 3.85: First five columns
Then, check the last five samples:
```
df_first_year.tail()
```
The output will be:
Figure 3.86: Last five columns

Use a least squares linear regression model and fit the model to the DayOfYear values and the dummy variable to predict TempAvgF:

# Note the year values need to be provided as an N x 1 array
model = LinearRegression()
model.fit(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF)

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Compute the r2 score:

# Note the year values need to be provided as an N x 1 array
r2 = model.score(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF)
print(f'r2 score = {r2:0.4f}')

The output will be as follows:

r2 score = 0.3631

Using the DayOfYear values, create a set of predictions using the model to construct a trendline:

trend_y = model.predict(df_first_year[['DayOfYear', 'inflection']].values)
trend_y

The output will be:

array([51.60311133, 51.74622654, 51.88934175, 52.03245696, 52.17557217,
       52.31868739, 52.4618026 , 52.60491781, 52.74803302, 52.89114823,
       53.03426345, 53.17737866, 53.32049387, 53.46360908, 53.60672429,
       53.7498395 , 53.89295472, 54.03606993, 54.17918514, 54.32230035,
       54.46541556, 54.60853078, 54.75164599, 54.8947612 , 55.03787641,
…
…
       73.88056649, 74.0236817 , 74.16679692, 74.30991213, 74.45302734,
       74.59614255, 74.73925776, 74.88237297, 75.02548819, 75.1686034 ,
       75.31171861, 75.45483382, 75.59794903, 75.74106425, 75.88417946,
       76.02729467, 76.17040988, 76.31352509, 76.4566403 , 76.59975552,
       76.74287073, 76.88598594, 77.02910115, 77.17221636, 77.31533157])

Plot the trendline against the data and the moving average:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.87: Predicted trendline

Activity 8: Other Model Types with Linear Regression

Solution

Use a sine curve function as the basis of the model:

# Using a sine curve
df_first_year['DayOfYear2'] = np.sin(df_first_year['DayOfYear'] / df_first_year['DayOfYear'].max())
df_first_year.head()

The output will be as follows:

Figure 3.88: First five rows

Fit the model:

# Note the year values need to be provided as an N x 1 array
model = LinearRegression()
model.fit(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF)

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Print the parameters of the model:

print(f'a = {model.coef_[0]}')
print(f'm = {model.coef_[1]}')
print(f'c = {model.intercept_}')

print('\nModel Definition')
print(f'y = {model.coef_[0]:0.4}x^2 + {model.coef_[1]:0.4}x + {model.intercept_:0.4f}')

The output will be as follows:

a = 634.322313570282
m = -1.4371290614190075
c = 39.93286585807408

Model Definition
y = 634.3x^2 + -1.437x + 39.9329

Compute the r2 value to measure the performance:

# Note the year values need to be provided as an N x 1 array
r2 = model.score(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF)
print(f'r2 score = {r2:0.4f}')

The output will be:

r2 score = 0.7047

Construct the trendline values:

trend_y = model.predict(df_first_year[['DayOfYear2', 'DayOfYear']].values)
trend_y

The output will be:

array([40.23360397, 40.53432905, 40.83502803, 41.13568788, 41.43629555,
       41.736838  , 42.03730219, 42.33767507, 42.6379436 , 42.93809474,
       43.23811546, 43.5379927 , 43.83771344, 44.13726463, 44.43663324,
       44.73580624, 45.03477059, 45.33351327, 45.63202123, 45.93028146,
       46.22828093, 46.52600661, 46.82344549, 47.12058453, 47.41741073,
…
…
       59.96306563, 59.55705293, 59.14720371, 58.73351024, 58.31596484,
       57.89455987, 57.46928769, 57.04014072, 56.60711138, 56.17019215,
       55.7293755 , 55.28465397, 54.83602011, 54.38346649, 53.92698572,
       53.46657045, 53.00221334, 52.53390709, 52.06164442, 51.58541811,
       51.10522093, 50.62104569, 50.13288526, 49.6407325 , 49.14458033])

Plot the trendline with the raw data and the moving average:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.89: Predicted trendline

Activity 9: Gradient Descent

Solution

Create a generic gradient descent model and normalize the day of year values as between 0 and 1:

grad_model = SGDRegressor(max_iter=None, tol=1e-3)
_x = df_first_year.DayOfYear / df_first_year.DayOfYear.max()

Fit the model:

grad_model.fit(_x.values.reshape((-1, 1)), df_first_year.TempAvgF)

The output will be as follows:

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='invscaling', loss='squared_loss', max_iter=None,
       n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, tol=None, validation_fraction=0.1,
       verbose=0, warm_start=False)

Print the details of the model:

print(f'm = {grad_model.coef_[0]}')
print(f'c = {grad_model.intercept_[0]}')

print('\nModel Definition')
print(f'y = {grad_model.coef_[0]:0.4}x + {grad_model.intercept_[0]:0.4f}')

The output will be as follows:

m = 26.406162532140563
c = 55.07470859678077

Model Definition
y = 26.41x + 55.0747

Prepare the x (_trend_x) trendline values by dividing them by the maximum. Predict y_trend_values using the gradient descent model:
```
_trend_x = trend_x / trend_x.max()
trend_y = grad_model.predict(_trend_x.reshape((-1, 1)))
trend_y
```
The output will be as follows:
```
array([55.14705425, 68.27778986, 81.48087113])
```

Plot the data and the moving average with the trendline:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(trend_x, trend_y, c='k', linestyle='--', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.90: Gradient descent predicted trendline

Activity 10: Autoregressors

Solution

Plot the complete set of average temperature values (df.TempAvgF) with years on the x axis:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values);
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');

The output will be:

Figure 3.91: Plot of temperature through the year

Create a 20-day lag and plot the lagged data on the original dataset:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values, label='Original Dataset');
plt.plot(df.TempAvgF.shift(20), c='r', linestyle='--',
    label='Lag 20');
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');
plt.legend();

The output will be:

Figure 3.92: Plot of temperature through the years with a 20-day lag

Construct an autocorrelation plot to see whether the average temperature can be used with an autoregressor. Where is the lag acceptable and where is it not for an autoregressor?
```
plt.figure(figsize=(10, 7))
pd.plotting.autocorrelation_plot(df.TempAvgF);
```
We'll get the following output:
Figure 3.93: Plot of autocorrelation versus lag
The lag is acceptable only when the autocorrelation line lies outside the 99% confidence bounds, as represented by the dashed lines.
Chose an acceptable lag and an unacceptable lag and construct lag plots using these values for acceptable lag:
```
plt.figure(figsize=(10,7))
ax = pd.plotting.lag_plot(df.TempAvgF, lag=5);
```
We'll get the following output:
Figure 3.94: Plot of acceptable lag
Use these values for unacceptable lag:
```
plt.figure(figsize=(10,7))
ax = pd.plotting.lag_plot(df.TempAvgF, lag=1000);
```
We'll get the following output:
Figure 3.95: Plot of unacceptable lag
Create an autoregressor model, note the selected lag, calculate the R2 value, and plot the autoregressor model with the original plot. The model is to project past the available data by 1,000 samples:
```
from statsmodels.tsa.ar_model import AR
model = AR(df.TempAvgF)
```

Fit the model to the data:

model_fit = model.fit()
print('Lag: %s' % model_fit.k_ar)
print('Coefficients: %s' % model_fit.params)

The output will be:

Lag: 23
Coefficients: const           1.909395
L1.TempAvgF     0.912076
L2.TempAvgF    -0.334043
L3.TempAvgF     0.157353
L4.TempAvgF     0.025721
L5.TempAvgF     0.041342
L6.TempAvgF     0.030831
L7.TempAvgF    -0.021230
L8.TempAvgF     0.020324
L9.TempAvgF     0.025147
L10.TempAvgF    0.059739
L11.TempAvgF   -0.017337
L12.TempAvgF    0.043553
L13.TempAvgF   -0.027795
L14.TempAvgF    0.053547
L15.TempAvgF    0.013070
L16.TempAvgF   -0.033157
L17.TempAvgF   -0.000072
L18.TempAvgF   -0.026307
L19.TempAvgF    0.025258
L20.TempAvgF    0.038341
L21.TempAvgF    0.007885
L22.TempAvgF   -0.008889
L23.TempAvgF   -0.011080
dtype: float64

Create a set of predictions for 1,000 days after the last sample:

predictions = model_fit.predict(start=model_fit.k_ar, end=len(df) + 1000)
predictions[:10].values

The output will be:

array([54.81171857, 56.89097085, 56.41891585, 50.98627626, 56.11843512,
       53.20665111, 55.13941554, 58.4679288 , 61.92497136, 49.46049801])

Plot the predictions, as well as the original dataset:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values, label='Original Dataset');
plt.plot(predictions, c='g', linestyle=':', label='Predictions');
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');
plt.legend();

The output will be:

Figure 3.96: Plot of temperature through the years

Enhance the view to look for differences by showing the 100th to 200th samples:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values, label='Original Dataset');
plt.plot(predictions, c='g', linestyle=':', label='Predictions');
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');
plt.xlim([100, 200])
plt.legend();

We'll get the following output: