Chapter 3: Regression Analysis
Activity 5: Plotting Data with a Moving Average
Solution
Load the dataset into a pandas DataFrame from the CSV file:
df = pd.read_csv('austin_weather.csv') df.head()
The output will show the initial five rows of the austin_weather.csv file:
Since we only need the Date and TempAvgF columns, we'll remove all others from the dataset:
df = df[['Date', 'TempAvgF']] df.head()
The output will be:
Initially, we are only interested in the first year's data, so we need to extract that information only. Create a column in the DataFrame for the year value, extract the year value as an integer from the strings in the Date column, and assign these values to the Year column. Note that temperatures are recorded daily:
df['Year'] = [int(dt[:4]) for dt in df.Date] df.head()
The output will be:
Repeat this process to extract the month values and store the values as integers in a Month column:
df['Month'] = [int(dt[5:7]) for dt in df.Date] df.head()
The output will be:
Copy the first year's worth of data to a DataFrame:
df_first_year = df[:365] df_first_year.head()
The output will be as follows:
Compute a 20-day moving average filter:
window = 20 rolling = df_first_year.TempAvgF.rolling(window).mean(); rolling.head(n=20)
The output will be:
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN 10 NaN 11 NaN 12 NaN 13 NaN 14 NaN 15 NaN 16 NaN 17 NaN 18 NaN 19 47.75 Name: TempAvgF, dtype: float64
Plot the raw data and the moving average signal, with the x axis as the day number in the year:
fig = plt.figure(figsize=(10, 7)) ax = fig.add_axes([1, 1, 1, 1]); # Temp measurements ax.scatter(range(1, 366), df_first_year.TempAvgF, label='Raw Data'); ax.plot(range(1, 366), rolling, c='r', label=f'{window} day moving average'); ax.set_title('Daily Mean Temperature Measurements') ax.set_xlabel('Day') ax.set_ylabel('Temperature (degF)') ax.set_xticks(range(1, 366), 10) ax.legend();
The output will be as follows:
Activity 6: Linear Regression Using the Least Squares Method
Solution
Visualize the measurements:
df.head()
The output will be as follows:
Visualize the rolling average values:
rolling.head(n=30)
The output will be as follows:
Create a linear regression model using the default parameters; that is, calculate a y intercept for the model and do not normalize the data:
model = LinearRegression() model
The output will be as follows:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Now fit the model, where the input data is the day number for the year (1 to 365) and the output is the average temperature. To make later calculations easier, insert a column (DayOfYear) that corresponds with the day of the year for that measurement:
df_first_year.loc[:,'DayOfYear'] = [i + 1 for i in df_first_year.index] df_first_year.head()
The output will be as follows:
Fit the model with the DayOfYear values as the input and df_first_year.TempAvgF as the output:
# Note the year values need to be provided as an N x 1 array model.fit(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF)
The output will be as follows:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Print the parameters of the model:
print(f'm = {model.coef_[0]}') print(f'c = {model.intercept_}') print('\nModel Definition') print(f'y = {model.coef_[0]:0.4}x + {model.intercept_:0.4f}')
The output will be:
m = 0.04909173467448788 c = 60.28196597922625 Model Definition y = 0.04909x + 60.2820
We can calculate the trendline values by using the first, middle, and last values (days in years) in the linear equation:
trend_x = np.array([ 1, 182.5, 365 ]) trend_y = model.predict(trend_x.reshape((-1, 1))) trend_y
The output will be as follows:
array([60.33105771, 69.24120756, 78.20044914])
Plot these values with the trendline:
fig = plt.figure(figsize=(10, 7)) ax = fig.add_axes([1, 1, 1, 1]); # Temp measurements ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data'); ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average'); ax.plot(trend_x, trend_y, c='k', label='Model: Predicted trendline') ax.set_title('Daily Mean Temperature Measurements') ax.set_xlabel('Day') ax.set_ylabel('Temperature (degF)') ax.set_xticks(range(1, 366), 10) ax.legend();
The output will be as follows:
Evaluate the performance of the model. How well does the model fit the data? Calculate the r2 score to find out:
# Note the year values need to be provided as an N x 1 array r2 = model.score(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF) print(f'r2 score = {r2:0.4f}')
The output will be:
r2 score = 0.1222
Activity 7: Dummy Variables
Solution
Plot the raw data (df) and moving average (rolling):
fig = plt.figure(figsize=(10, 7)) ax = fig.add_axes([1, 1, 1, 1]); # Temp measurements ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data'); ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average'); ax.set_title('Daily Mean Temperature Measurements') ax.set_xlabel('Day') ax.set_ylabel('Temperature (degF)') ax.set_xticks(range(1, 366), 10) ax.legend();
The output will be:
Looking at the preceding plot, there seems to be an inflection point around day 250. Create a dummy variable to introduce this feature into the linear model:
df_first_year.loc[:,'inflection'] = [1 * int(i < 250) for i in df_first_year.DayOfYear]
Check the first and last samples to confirm that the dummy variable is correct. Check the first five samples:
df_first_year.head()
The output will be as follows:
Then, check the last five samples:
df_first_year.tail()
The output will be:
Use a least squares linear regression model and fit the model to the DayOfYear values and the dummy variable to predict TempAvgF:
# Note the year values need to be provided as an N x 1 array model = LinearRegression() model.fit(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF)
The output will be as follows:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Compute the r2 score:
# Note the year values need to be provided as an N x 1 array r2 = model.score(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF) print(f'r2 score = {r2:0.4f}')
The output will be as follows:
r2 score = 0.3631
Using the DayOfYear values, create a set of predictions using the model to construct a trendline:
trend_y = model.predict(df_first_year[['DayOfYear', 'inflection']].values) trend_y
The output will be:
array([51.60311133, 51.74622654, 51.88934175, 52.03245696, 52.17557217, 52.31868739, 52.4618026 , 52.60491781, 52.74803302, 52.89114823, 53.03426345, 53.17737866, 53.32049387, 53.46360908, 53.60672429, 53.7498395 , 53.89295472, 54.03606993, 54.17918514, 54.32230035, 54.46541556, 54.60853078, 54.75164599, 54.8947612 , 55.03787641, … … 73.88056649, 74.0236817 , 74.16679692, 74.30991213, 74.45302734, 74.59614255, 74.73925776, 74.88237297, 75.02548819, 75.1686034 , 75.31171861, 75.45483382, 75.59794903, 75.74106425, 75.88417946, 76.02729467, 76.17040988, 76.31352509, 76.4566403 , 76.59975552, 76.74287073, 76.88598594, 77.02910115, 77.17221636, 77.31533157])
Plot the trendline against the data and the moving average:
fig = plt.figure(figsize=(10, 7)) ax = fig.add_axes([1, 1, 1, 1]); # Temp measurements ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data'); ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average'); ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline') ax.set_title('Daily Mean Temperature Measurements') ax.set_xlabel('Day') ax.set_ylabel('Temperature (degF)') ax.set_xticks(range(1, 366), 10) ax.legend();
The output will be as follows:
Activity 8: Other Model Types with Linear Regression
Solution
Use a sine curve function as the basis of the model:
# Using a sine curve df_first_year['DayOfYear2'] = np.sin(df_first_year['DayOfYear'] / df_first_year['DayOfYear'].max()) df_first_year.head()
The output will be as follows:
Fit the model:
# Note the year values need to be provided as an N x 1 array model = LinearRegression() model.fit(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF)
The output will be as follows:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Print the parameters of the model:
print(f'a = {model.coef_[0]}') print(f'm = {model.coef_[1]}') print(f'c = {model.intercept_}') print('\nModel Definition') print(f'y = {model.coef_[0]:0.4}x^2 + {model.coef_[1]:0.4}x + {model.intercept_:0.4f}')
The output will be as follows:
a = 634.322313570282 m = -1.4371290614190075 c = 39.93286585807408 Model Definition y = 634.3x^2 + -1.437x + 39.9329
Compute the r2 value to measure the performance:
# Note the year values need to be provided as an N x 1 array r2 = model.score(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF) print(f'r2 score = {r2:0.4f}')
The output will be:
r2 score = 0.7047
Construct the trendline values:
trend_y = model.predict(df_first_year[['DayOfYear2', 'DayOfYear']].values) trend_y
The output will be:
array([40.23360397, 40.53432905, 40.83502803, 41.13568788, 41.43629555, 41.736838 , 42.03730219, 42.33767507, 42.6379436 , 42.93809474, 43.23811546, 43.5379927 , 43.83771344, 44.13726463, 44.43663324, 44.73580624, 45.03477059, 45.33351327, 45.63202123, 45.93028146, 46.22828093, 46.52600661, 46.82344549, 47.12058453, 47.41741073, … … 59.96306563, 59.55705293, 59.14720371, 58.73351024, 58.31596484, 57.89455987, 57.46928769, 57.04014072, 56.60711138, 56.17019215, 55.7293755 , 55.28465397, 54.83602011, 54.38346649, 53.92698572, 53.46657045, 53.00221334, 52.53390709, 52.06164442, 51.58541811, 51.10522093, 50.62104569, 50.13288526, 49.6407325 , 49.14458033])
Plot the trendline with the raw data and the moving average:
fig = plt.figure(figsize=(10, 7)) ax = fig.add_axes([1, 1, 1, 1]); # Temp measurements ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data'); ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average'); ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline') ax.set_title('Daily Mean Temperature Measurements') ax.set_xlabel('Day') ax.set_ylabel('Temperature (degF)') ax.set_xticks(range(1, 366), 10) ax.legend();
The output will be as follows:
Activity 9: Gradient Descent
Solution
Create a generic gradient descent model and normalize the day of year values as between 0 and 1:
grad_model = SGDRegressor(max_iter=None, tol=1e-3) _x = df_first_year.DayOfYear / df_first_year.DayOfYear.max()
Fit the model:
grad_model.fit(_x.values.reshape((-1, 1)), df_first_year.TempAvgF)
The output will be as follows:
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1, eta0=0.01, fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling', loss='squared_loss', max_iter=None, n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None, shuffle=True, tol=None, validation_fraction=0.1, verbose=0, warm_start=False)
Print the details of the model:
print(f'm = {grad_model.coef_[0]}') print(f'c = {grad_model.intercept_[0]}') print('\nModel Definition') print(f'y = {grad_model.coef_[0]:0.4}x + {grad_model.intercept_[0]:0.4f}')
The output will be as follows:
m = 26.406162532140563 c = 55.07470859678077 Model Definition y = 26.41x + 55.0747
Prepare the x (_trend_x) trendline values by dividing them by the maximum. Predict y_trend_values using the gradient descent model:
_trend_x = trend_x / trend_x.max() trend_y = grad_model.predict(_trend_x.reshape((-1, 1))) trend_y
The output will be as follows:
array([55.14705425, 68.27778986, 81.48087113])
Plot the data and the moving average with the trendline:
fig = plt.figure(figsize=(10, 7)) ax = fig.add_axes([1, 1, 1, 1]); # Temp measurements ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data'); ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average'); ax.plot(trend_x, trend_y, c='k', linestyle='--', label='Model: Predicted trendline') ax.set_title('Daily Mean Temperature Measurements') ax.set_xlabel('Day') ax.set_ylabel('Temperature (degF)') ax.set_xticks(range(1, 366), 10) ax.legend();
The output will be as follows:
Activity 10: Autoregressors
Solution
Plot the complete set of average temperature values (df.TempAvgF) with years on the x axis:
plt.figure(figsize=(10, 7)) plt.plot(df.TempAvgF.values); yrs = [yr for yr in df.Year.unique()] plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs); plt.title('Austin Texas Average Daily Temperature'); plt.xlabel('Year'); plt.ylabel('Temperature (F)');
The output will be:
Create a 20-day lag and plot the lagged data on the original dataset:
plt.figure(figsize=(10, 7)) plt.plot(df.TempAvgF.values, label='Original Dataset'); plt.plot(df.TempAvgF.shift(20), c='r', linestyle='--', label='Lag 20'); yrs = [yr for yr in df.Year.unique()] plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs); plt.title('Austin Texas Average Daily Temperature'); plt.xlabel('Year'); plt.ylabel('Temperature (F)'); plt.legend();
The output will be:
Construct an autocorrelation plot to see whether the average temperature can be used with an autoregressor. Where is the lag acceptable and where is it not for an autoregressor?
plt.figure(figsize=(10, 7)) pd.plotting.autocorrelation_plot(df.TempAvgF);
We'll get the following output:
The lag is acceptable only when the autocorrelation line lies outside the 99% confidence bounds, as represented by the dashed lines.
Chose an acceptable lag and an unacceptable lag and construct lag plots using these values for acceptable lag:
plt.figure(figsize=(10,7)) ax = pd.plotting.lag_plot(df.TempAvgF, lag=5);
We'll get the following output:
Use these values for unacceptable lag:
plt.figure(figsize=(10,7)) ax = pd.plotting.lag_plot(df.TempAvgF, lag=1000);
We'll get the following output:
Create an autoregressor model, note the selected lag, calculate the R2 value, and plot the autoregressor model with the original plot. The model is to project past the available data by 1,000 samples:
from statsmodels.tsa.ar_model import AR model = AR(df.TempAvgF)
Fit the model to the data:
model_fit = model.fit() print('Lag: %s' % model_fit.k_ar) print('Coefficients: %s' % model_fit.params)
The output will be:
Lag: 23 Coefficients: const 1.909395 L1.TempAvgF 0.912076 L2.TempAvgF -0.334043 L3.TempAvgF 0.157353 L4.TempAvgF 0.025721 L5.TempAvgF 0.041342 L6.TempAvgF 0.030831 L7.TempAvgF -0.021230 L8.TempAvgF 0.020324 L9.TempAvgF 0.025147 L10.TempAvgF 0.059739 L11.TempAvgF -0.017337 L12.TempAvgF 0.043553 L13.TempAvgF -0.027795 L14.TempAvgF 0.053547 L15.TempAvgF 0.013070 L16.TempAvgF -0.033157 L17.TempAvgF -0.000072 L18.TempAvgF -0.026307 L19.TempAvgF 0.025258 L20.TempAvgF 0.038341 L21.TempAvgF 0.007885 L22.TempAvgF -0.008889 L23.TempAvgF -0.011080 dtype: float64
Create a set of predictions for 1,000 days after the last sample:
predictions = model_fit.predict(start=model_fit.k_ar, end=len(df) + 1000) predictions[:10].values
The output will be:
array([54.81171857, 56.89097085, 56.41891585, 50.98627626, 56.11843512, 53.20665111, 55.13941554, 58.4679288 , 61.92497136, 49.46049801])
Plot the predictions, as well as the original dataset:
plt.figure(figsize=(10, 7)) plt.plot(df.TempAvgF.values, label='Original Dataset'); plt.plot(predictions, c='g', linestyle=':', label='Predictions'); yrs = [yr for yr in df.Year.unique()] plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs); plt.title('Austin Texas Average Daily Temperature'); plt.xlabel('Year'); plt.ylabel('Temperature (F)'); plt.legend();
The output will be:
Enhance the view to look for differences by showing the 100th to 200th samples:
plt.figure(figsize=(10, 7)) plt.plot(df.TempAvgF.values, label='Original Dataset'); plt.plot(predictions, c='g', linestyle=':', label='Predictions'); yrs = [yr for yr in df.Year.unique()] plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs); plt.title('Austin Texas Average Daily Temperature'); plt.xlabel('Year'); plt.ylabel('Temperature (F)'); plt.xlim([100, 200]) plt.legend();
We'll get the following output: