Chapter 5: Ensemble Modeling
Activity 14: Stacking with Standalone and Ensemble Algorithms
Solution
Import the relevant libraries:
import pandas as pd import numpy as np import seaborn as sns %matplotlib inline import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error from sklearn.model_selection import KFold from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
Read the data and print the first five rows:
data = pd.read_csv('house_prices.csv') data.head()
The output will be as follows:
Preprocess the dataset to remove null values and one-hot encode categorical variables to prepare the data for modeling.
First, we remove all columns where more than 10% of the values are null. To do this, calculate the fraction of missing values by using the .isnull() method to get a mask DataFrame and apply the .mean() method to get the fraction of null values in each column. Multiply the result by 100 to get the series as percentage values.
Then, find the subset of the series having a percentage value lower than 10 and save the index (which will give us the column names) as a list. Print the list to see the columns we get:
perc_missing = data.isnull().mean()*100 cols = perc_missing[perc_missing < 10].index.tolist() cols
The output will be:
As the first column is id, we will exclude this column as well, since it will not add any value to the model.
We will subset the data to include all columns in the col list except the first element, which is id:
data = data.loc[:, cols[1:]]
For the categorical variables, we replace null values with a string, NA, and one-hot encode the columns using pandas' .get_dummies() method, while for the numerical variables we will replace the null values with -1. Then, we combine the numerical and categorical columns to get the final DataFrame:
data_obj = pd.get_dummies(data.select_dtypes(include=[np.object]).fillna('NA')) data_num = data.select_dtypes(include=[np.number]).fillna(-1) data_final = pd.concat([data_obj, data_num], axis=1)
Divide the dataset into train and validation DataFrames.
We use scikit-learn's train_test_split() method to divide the final DataFrame into training and validation sets in the ratio 4:1. We further split each of the two sets into their respective x and y values to represent the features and target variable respectively:
train, val = train, val = train_test_split(data_final, test_size=0.2, random_state=11) x_train = train.drop(columns=['SalePrice']) y_train = train['SalePrice'].values x_val = val.drop(columns=['SalePrice']) y_val = val['SalePrice'].values
Initialize dictionaries in which to store train and validation MAE values. We will create two dictionaries, in which we will store the MAE values on the train and validation datasets:
train_mae_values, val_mae_values = {}, {}
Train a decision tree model and save the scores. We will use scikit-learn's DecisionTreeRegressor class to train a regression model using a single decision tree:
# Decision Tree dt_params = { 'criterion': 'mae', 'min_samples_leaf': 10, 'random_state': 11 } dt = DecisionTreeRegressor(**dt_params) dt.fit(x_train, y_train) dt_preds_train = dt.predict(x_train) dt_preds_val = dt.predict(x_val) train_mae_values['dt'] = mean_absolute_error(y_true=y_train, y_pred=dt_preds_train) val_mae_values['dt'] = mean_absolute_error(y_true=y_val, y_pred=dt_preds_val)
Train a k-nearest neighbors model and save the scores. We will use scikit-learn's kNeighborsRegressor class to train a regression model with k=5:
# k-Nearest Neighbors knn_params = { 'n_neighbors': 5 } knn = KNeighborsRegressor(**knn_params) knn.fit(x_train, y_train) knn_preds_train = knn.predict(x_train) knn_preds_val = knn.predict(x_val) train_mae_values['knn'] = mean_absolute_error(y_true=y_train, y_pred=knn_preds_train) val_mae_values['knn'] = mean_absolute_error(y_true=y_val, y_pred=knn_preds_val)
Train a Random Forest model and save the scores. We will use scikit-learn's RandomForestRegressor class to train a regression model using bagging:
# Random Forest rf_params = { 'n_estimators': 50, 'criterion': 'mae', 'max_features': 'sqrt', 'min_samples_leaf': 10, 'random_state': 11, 'n_jobs': -1 } rf = RandomForestRegressor(**rf_params) rf.fit(x_train, y_train) rf_preds_train = rf.predict(x_train) rf_preds_val = rf.predict(x_val) train_mae_values['rf'] = mean_absolute_error(y_true=y_train, y_pred=rf_preds_train) val_mae_values['rf'] = mean_absolute_error(y_true=y_val, y_pred=rf_preds_val)
Train a gradient boosting model and save the scores. We will use scikit-learn's GradientBoostingRegressor class to train a boosted regression model:
# Gradient Boosting gbr_params = { 'n_estimators': 50, 'criterion': 'mae', 'max_features': 'sqrt', 'max_depth': 3, 'min_samples_leaf': 5, 'random_state': 11 } gbr = GradientBoostingRegressor(**gbr_params) gbr.fit(x_train, y_train) gbr_preds_train = gbr.predict(x_train) gbr_preds_val = gbr.predict(x_val) train_mae_values['gbr'] = mean_absolute_error(y_true=y_train, y_pred=gbr_preds_train) val_mae_values['gbr'] = mean_absolute_error(y_true=y_val, y_pred=gbr_preds_val)
Prepare the training and validation datasets with the four meta estimators having the same hyperparameters that were used in the previous steps. We will create a num_base_predictors variable that represents the number of base estimators we have in the stacked model to help calculate the shape of the datasets for training and validation. This step can be coded almost identically to the exercise in the chapter, with a different number (and type) of base estimators.
First, we create a new training set with additional columns for predictions from base predictors, in the same way as was done previously:
num_base_predictors = len(train_mae_values) # 4 x_train_with_metapreds = np.zeros((x_train.shape[0], x_train.shape[1]+num_base_predictors)) x_train_with_metapreds[:, :-num_base_predictors] = x_train x_train_with_metapreds[:, -num_base_predictors:] = -1
Then, we train the base models using the k-fold strategy. We save the predictions in each iteration in a list, and iterate over the list to assign the predictions to the columns in that fold:
kf = KFold(n_splits=5, random_state=11) for train_indices, val_indices in kf.split(x_train): kfold_x_train, kfold_x_val = x_train.iloc[train_indices], x_train.iloc[val_indices] kfold_y_train, kfold_y_val = y_train[train_indices], y_train[val_indices] predictions = [] dt = DecisionTreeRegressor(**dt_params) dt.fit(kfold_x_train, kfold_y_train) predictions.append(dt.predict(kfold_x_val)) knn = KNeighborsRegressor(**knn_params) knn.fit(kfold_x_train, kfold_y_train) predictions.append(knn.predict(kfold_x_val)) gbr = GradientBoostingRegressor(**gbr_params) rf.fit(kfold_x_train, kfold_y_train) predictions.append(rf.predict(kfold_x_val)) gbr = GradientBoostingRegressor(**gbr_params) gbr.fit(kfold_x_train, kfold_y_train) predictions.append(gbr.predict(kfold_x_val)) for i, preds in enumerate(predictions): x_train_with_metapreds[val_indices, -(i+1)] = preds
After that, we create a new validation set with additional columns for predictions from base predictors:
x_val_with_metapreds = np.zeros((x_val.shape[0], x_val.shape[1]+num_base_predictors)) x_val_with_metapreds[:, :-num_base_predictors] = x_val x_val_with_metapreds[:, -num_base_predictors:] = -1
Lastly, we fit the base models on the complete training set to get meta features for the validation set:
predictions = [] dt = DecisionTreeRegressor(**dt_params) dt.fit(x_train, y_train) predictions.append(dt.predict(x_val)) knn = KNeighborsRegressor(**knn_params) knn.fit(x_train, y_train) predictions.append(knn.predict(x_val)) gbr = GradientBoostingRegressor(**gbr_params) rf.fit(x_train, y_train) predictions.append(rf.predict(x_val)) gbr = GradientBoostingRegressor(**gbr_params) gbr.fit(x_train, y_train) predictions.append(gbr.predict(x_val)) for i, preds in enumerate(predictions): x_val_with_metapreds[:, -(i+1)] = preds
Train a linear regression model as the stacked model. To train the stacked model, we train the logistic regression model on all the columns of the training dataset, plus the meta predictions from the base estimators. We then use the final predictions to calculate the MAE values, which we store in the same train_mae_values and val_mae_values dictionaries:
lr = LinearRegression(normalize=False) lr.fit(x_train_with_metapreds, y_train) lr_preds_train = lr.predict(x_train_with_metapreds) lr_preds_val = lr.predict(x_val_with_metapreds) train_mae_values['lr'] = mean_absolute_error(y_true=y_train, y_pred=lr_preds_train) val_mae_values['lr'] = mean_absolute_error(y_true=y_val, y_pred=lr_preds_val)
Visualize the train and validation errors for each individual model and the stacked model. Then, we will convert the dictionaries into two series and combine them to form two columns of a Pandas DataFrame:
mae_scores = pd.concat([pd.Series(train_mae_values, name='train'), pd.Series(val_mae_values, name='val')], axis=1) mae_scores
The output will be as follows:
We then plot a bar chart from this DataFrame to visualize the MAE values for the train and validation sets using each model:
mae_scores.plot(kind='bar', figsize=(10,7)) plt.ylabel('MAE') plt.xlabel('Model') plt.show()
The output will be as follows:
As we can see in the plot, the linear regression stacked model has the lowest value of mean absolute error on both training and validation datasets, even compared to the other ensemble models (Random Forest and gradient boosted regressor).