Packt+ | Advance your knowledge in tech

You're reading from Applied Supervised Learning with Python Use scikit-learn to build predictive models from real-world datasets and prepare yourself for the future of machine learning

Product type Paperback

Published in Apr 2019

Publisher

ISBN-13 9781789954920

Length 404 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (2):

Ishita Mathur

Benjamin Johnston

View More author details

Table of Contents (9) Chapters

Applied Supervised Learning with Python

Preface

1. Python Machine Learning Toolkit FREE CHAPTER

2. Exploratory Data Analysis and Visualization

3. Regression Analysis

4. Classification

5. Ensemble Modeling

6. Model Evaluation

Appendix

Chapter 5: Ensemble Modeling

Activity 14: Stacking with Standalone and Ensemble Algorithms

Solution

Import the relevant libraries:

import pandas as pd
import numpy as np
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

Read the data and print the first five rows:
```
data = pd.read_csv('house_prices.csv')
data.head()
```
The output will be as follows:
Figure 5.19: The first 5 rows
Preprocess the dataset to remove null values and one-hot encode categorical variables to prepare the data for modeling.
First, we remove all columns where more than 10% of the values are null. To do this, calculate the fraction of missing values by using the .isnull() method to get a mask DataFrame and apply the .mean() method to get the fraction of null values in each column. Multiply the result by 100 to get the series as percentage values.
Then, find the subset of the series having a percentage value lower than 10 and save the index (which will give us the column names) as a list. Print the list to see the columns we get:
```
perc_missing = data.isnull().mean()*100
cols = perc_missing[perc_missing < 10].index.tolist() 
cols
```
The output will be:
Figure 5.20: Output of preprocessing the dataset
As the first column is id, we will exclude this column as well, since it will not add any value to the model.
We will subset the data to include all columns in the col list except the first element, which is id:
```
data = data.loc[:, cols[1:]]
```
For the categorical variables, we replace null values with a string, NA, and one-hot encode the columns using pandas' .get_dummies() method, while for the numerical variables we will replace the null values with -1. Then, we combine the numerical and categorical columns to get the final DataFrame:
```
data_obj = pd.get_dummies(data.select_dtypes(include=[np.object]).fillna('NA'))
data_num = data.select_dtypes(include=[np.number]).fillna(-1)

data_final = pd.concat([data_obj, data_num], axis=1)
```
Divide the dataset into train and validation DataFrames.
We use scikit-learn's train_test_split() method to divide the final DataFrame into training and validation sets in the ratio 4:1. We further split each of the two sets into their respective x and y values to represent the features and target variable respectively:
```
train, val = train, val = train_test_split(data_final, test_size=0.2, random_state=11)

x_train = train.drop(columns=['SalePrice'])
y_train = train['SalePrice'].values

x_val = val.drop(columns=['SalePrice'])
y_val = val['SalePrice'].values
```
Initialize dictionaries in which to store train and validation MAE values. We will create two dictionaries, in which we will store the MAE values on the train and validation datasets:
```
train_mae_values, val_mae_values = {}, {}
```

Train a decision tree model and save the scores. We will use scikit-learn's DecisionTreeRegressor class to train a regression model using a single decision tree:

# Decision Tree

dt_params = {
    'criterion': 'mae',
    'min_samples_leaf': 10,
    'random_state': 11
}

dt = DecisionTreeRegressor(**dt_params)

dt.fit(x_train, y_train)
dt_preds_train = dt.predict(x_train)
dt_preds_val = dt.predict(x_val)

train_mae_values['dt'] = mean_absolute_error(y_true=y_train, y_pred=dt_preds_train)
val_mae_values['dt'] = mean_absolute_error(y_true=y_val, y_pred=dt_preds_val)

Train a k-nearest neighbors model and save the scores. We will use scikit-learn's kNeighborsRegressor class to train a regression model with k=5:

# k-Nearest Neighbors

knn_params = {
    'n_neighbors': 5
}

knn = KNeighborsRegressor(**knn_params)

knn.fit(x_train, y_train)
knn_preds_train = knn.predict(x_train)
knn_preds_val = knn.predict(x_val)

train_mae_values['knn'] = mean_absolute_error(y_true=y_train, y_pred=knn_preds_train)
val_mae_values['knn'] = mean_absolute_error(y_true=y_val, y_pred=knn_preds_val)

Train a Random Forest model and save the scores. We will use scikit-learn's RandomForestRegressor class to train a regression model using bagging:

# Random Forest

rf_params = {
    'n_estimators': 50,
    'criterion': 'mae',
    'max_features': 'sqrt',
    'min_samples_leaf': 10,
    'random_state': 11,
    'n_jobs': -1
}

rf = RandomForestRegressor(**rf_params)

rf.fit(x_train, y_train)
rf_preds_train = rf.predict(x_train)
rf_preds_val = rf.predict(x_val)

train_mae_values['rf'] = mean_absolute_error(y_true=y_train, y_pred=rf_preds_train)
val_mae_values['rf'] = mean_absolute_error(y_true=y_val, y_pred=rf_preds_val)

Train a gradient boosting model and save the scores. We will use scikit-learn's GradientBoostingRegressor class to train a boosted regression model:

# Gradient Boosting

gbr_params = {
    'n_estimators': 50,
    'criterion': 'mae',
    'max_features': 'sqrt',
    'max_depth': 3,
    'min_samples_leaf': 5,
    'random_state': 11
}

gbr = GradientBoostingRegressor(**gbr_params)

gbr.fit(x_train, y_train)
gbr_preds_train = gbr.predict(x_train)
gbr_preds_val = gbr.predict(x_val)

train_mae_values['gbr'] = mean_absolute_error(y_true=y_train, y_pred=gbr_preds_train)
val_mae_values['gbr'] = mean_absolute_error(y_true=y_val, y_pred=gbr_preds_val)

Prepare the training and validation datasets with the four meta estimators having the same hyperparameters that were used in the previous steps. We will create a num_base_predictors variable that represents the number of base estimators we have in the stacked model to help calculate the shape of the datasets for training and validation. This step can be coded almost identically to the exercise in the chapter, with a different number (and type) of base estimators.

First, we create a new training set with additional columns for predictions from base predictors, in the same way as was done previously:

num_base_predictors = len(train_mae_values) # 4

x_train_with_metapreds = np.zeros((x_train.shape[0], x_train.shape[1]+num_base_predictors))
x_train_with_metapreds[:, :-num_base_predictors] = x_train
x_train_with_metapreds[:, -num_base_predictors:] = -1

Then, we train the base models using the k-fold strategy. We save the predictions in each iteration in a list, and iterate over the list to assign the predictions to the columns in that fold:

kf = KFold(n_splits=5, random_state=11)

for train_indices, val_indices in kf.split(x_train):
    kfold_x_train, kfold_x_val = x_train.iloc[train_indices], x_train.iloc[val_indices]
    kfold_y_train, kfold_y_val = y_train[train_indices], y_train[val_indices]
    
    predictions = []
    
    dt = DecisionTreeRegressor(**dt_params)
    dt.fit(kfold_x_train, kfold_y_train)
    predictions.append(dt.predict(kfold_x_val))

    knn = KNeighborsRegressor(**knn_params)
    knn.fit(kfold_x_train, kfold_y_train)
    predictions.append(knn.predict(kfold_x_val))

    gbr = GradientBoostingRegressor(**gbr_params)
    rf.fit(kfold_x_train, kfold_y_train)
    predictions.append(rf.predict(kfold_x_val))

    gbr = GradientBoostingRegressor(**gbr_params)
    gbr.fit(kfold_x_train, kfold_y_train)
    predictions.append(gbr.predict(kfold_x_val))
    
    for i, preds in enumerate(predictions):
        x_train_with_metapreds[val_indices, -(i+1)] = preds

After that, we create a new validation set with additional columns for predictions from base predictors:

x_val_with_metapreds = np.zeros((x_val.shape[0], x_val.shape[1]+num_base_predictors))
x_val_with_metapreds[:, :-num_base_predictors] = x_val
x_val_with_metapreds[:, -num_base_predictors:] = -1

Lastly, we fit the base models on the complete training set to get meta features for the validation set:

predictions = []
    
dt = DecisionTreeRegressor(**dt_params)
dt.fit(x_train, y_train)
predictions.append(dt.predict(x_val))

knn = KNeighborsRegressor(**knn_params)
knn.fit(x_train, y_train)
predictions.append(knn.predict(x_val))

gbr = GradientBoostingRegressor(**gbr_params)
rf.fit(x_train, y_train)
predictions.append(rf.predict(x_val))

gbr = GradientBoostingRegressor(**gbr_params)
gbr.fit(x_train, y_train)
predictions.append(gbr.predict(x_val))

for i, preds in enumerate(predictions):
    x_val_with_metapreds[:, -(i+1)] = preds

Train a linear regression model as the stacked model. To train the stacked model, we train the logistic regression model on all the columns of the training dataset, plus the meta predictions from the base estimators. We then use the final predictions to calculate the MAE values, which we store in the same train_mae_values and val_mae_values dictionaries:
```
lr = LinearRegression(normalize=False)
lr.fit(x_train_with_metapreds, y_train)
lr_preds_train = lr.predict(x_train_with_metapreds)
lr_preds_val = lr.predict(x_val_with_metapreds)

train_mae_values['lr'] = mean_absolute_error(y_true=y_train, y_pred=lr_preds_train)
val_mae_values['lr'] = mean_absolute_error(y_true=y_val, y_pred=lr_preds_val)
```
Visualize the train and validation errors for each individual model and the stacked model. Then, we will convert the dictionaries into two series and combine them to form two columns of a Pandas DataFrame:
```
mae_scores = pd.concat([pd.Series(train_mae_values, name='train'), 
                        pd.Series(val_mae_values, name='val')], 
                       axis=1)
mae_scores
```
The output will be as follows:
Figure 5.21: The train and validation errors for each individual model and the stacked model
We then plot a bar chart from this DataFrame to visualize the MAE values for the train and validation sets using each model:
```
mae_scores.plot(kind='bar', figsize=(10,7))
plt.ylabel('MAE')
plt.xlabel('Model')
plt.show()
```
The output will be as follows:
Figure 5.22: Bar chart visualizing the MAE values

As we can see in the plot, the linear regression stacked model has the lowest value of mean absolute error on both training and validation datasets, even compared to the other ensemble models (Random Forest and gradient boosted regressor).