Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Supervised Learning with Python

You're reading from   Applied Supervised Learning with Python Use scikit-learn to build predictive models from real-world datasets and prepare yourself for the future of machine learning

Arrow left icon
Product type Paperback
Published in Apr 2019
Publisher
ISBN-13 9781789954920
Length 404 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Ishita Mathur Ishita Mathur
Author Profile Icon Ishita Mathur
Ishita Mathur
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Arrow right icon
View More author details
Toc

Chapter 5: Ensemble Modeling


Activity 14: Stacking with Standalone and Ensemble Algorithms

Solution

  1. Import the relevant libraries:

    import pandas as pd
    import numpy as np
    import seaborn as sns
    
    %matplotlib inline
    import matplotlib.pyplot as plt
    
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import KFold
    
    from sklearn.linear_model import LinearRegression
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
  2. Read the data and print the first five rows:

    data = pd.read_csv('house_prices.csv')
    data.head()

    The output will be as follows:

    Figure 5.19: The first 5 rows

  3. Preprocess the dataset to remove null values and one-hot encode categorical variables to prepare the data for modeling.

    First, we remove all columns where more than 10% of the values are null. To do this, calculate the fraction of missing values by using the .isnull() method to get a mask DataFrame and apply the .mean() method to get the fraction of null values in each column. Multiply the result by 100 to get the series as percentage values.

    Then, find the subset of the series having a percentage value lower than 10 and save the index (which will give us the column names) as a list. Print the list to see the columns we get:

    perc_missing = data.isnull().mean()*100
    cols = perc_missing[perc_missing < 10].index.tolist() 
    cols

    The output will be:

    Figure 5.20: Output of preprocessing the dataset

    As the first column is id, we will exclude this column as well, since it will not add any value to the model.

    We will subset the data to include all columns in the col list except the first element, which is id:

    data = data.loc[:, cols[1:]]

    For the categorical variables, we replace null values with a string, NA, and one-hot encode the columns using pandas' .get_dummies() method, while for the numerical variables we will replace the null values with -1. Then, we combine the numerical and categorical columns to get the final DataFrame:

    data_obj = pd.get_dummies(data.select_dtypes(include=[np.object]).fillna('NA'))
    data_num = data.select_dtypes(include=[np.number]).fillna(-1)
    
    data_final = pd.concat([data_obj, data_num], axis=1)
  4. Divide the dataset into train and validation DataFrames.

    We use scikit-learn's train_test_split() method to divide the final DataFrame into training and validation sets in the ratio 4:1. We further split each of the two sets into their respective x and y values to represent the features and target variable respectively:

    train, val = train, val = train_test_split(data_final, test_size=0.2, random_state=11)
    
    x_train = train.drop(columns=['SalePrice'])
    y_train = train['SalePrice'].values
    
    x_val = val.drop(columns=['SalePrice'])
    y_val = val['SalePrice'].values
  5. Initialize dictionaries in which to store train and validation MAE values. We will create two dictionaries, in which we will store the MAE values on the train and validation datasets:

    train_mae_values, val_mae_values = {}, {}
  6. Train a decision tree model and save the scores. We will use scikit-learn's DecisionTreeRegressor class to train a regression model using a single decision tree:

    # Decision Tree
    
    dt_params = {
        'criterion': 'mae',
        'min_samples_leaf': 10,
        'random_state': 11
    }
    
    dt = DecisionTreeRegressor(**dt_params)
    
    dt.fit(x_train, y_train)
    dt_preds_train = dt.predict(x_train)
    dt_preds_val = dt.predict(x_val)
    
    train_mae_values['dt'] = mean_absolute_error(y_true=y_train, y_pred=dt_preds_train)
    val_mae_values['dt'] = mean_absolute_error(y_true=y_val, y_pred=dt_preds_val)
  7. Train a k-nearest neighbors model and save the scores. We will use scikit-learn's kNeighborsRegressor class to train a regression model with k=5:

    # k-Nearest Neighbors
    
    knn_params = {
        'n_neighbors': 5
    }
    
    knn = KNeighborsRegressor(**knn_params)
    
    knn.fit(x_train, y_train)
    knn_preds_train = knn.predict(x_train)
    knn_preds_val = knn.predict(x_val)
    
    train_mae_values['knn'] = mean_absolute_error(y_true=y_train, y_pred=knn_preds_train)
    val_mae_values['knn'] = mean_absolute_error(y_true=y_val, y_pred=knn_preds_val)
  8. Train a Random Forest model and save the scores. We will use scikit-learn's RandomForestRegressor class to train a regression model using bagging:

    # Random Forest
    
    rf_params = {
        'n_estimators': 50,
        'criterion': 'mae',
        'max_features': 'sqrt',
        'min_samples_leaf': 10,
        'random_state': 11,
        'n_jobs': -1
    }
    
    rf = RandomForestRegressor(**rf_params)
    
    rf.fit(x_train, y_train)
    rf_preds_train = rf.predict(x_train)
    rf_preds_val = rf.predict(x_val)
    
    train_mae_values['rf'] = mean_absolute_error(y_true=y_train, y_pred=rf_preds_train)
    val_mae_values['rf'] = mean_absolute_error(y_true=y_val, y_pred=rf_preds_val)
  9. Train a gradient boosting model and save the scores. We will use scikit-learn's GradientBoostingRegressor class to train a boosted regression model:

    # Gradient Boosting
    
    gbr_params = {
        'n_estimators': 50,
        'criterion': 'mae',
        'max_features': 'sqrt',
        'max_depth': 3,
        'min_samples_leaf': 5,
        'random_state': 11
    }
    
    gbr = GradientBoostingRegressor(**gbr_params)
    
    gbr.fit(x_train, y_train)
    gbr_preds_train = gbr.predict(x_train)
    gbr_preds_val = gbr.predict(x_val)
    
    train_mae_values['gbr'] = mean_absolute_error(y_true=y_train, y_pred=gbr_preds_train)
    val_mae_values['gbr'] = mean_absolute_error(y_true=y_val, y_pred=gbr_preds_val)
  10. Prepare the training and validation datasets with the four meta estimators having the same hyperparameters that were used in the previous steps. We will create a num_base_predictors variable that represents the number of base estimators we have in the stacked model to help calculate the shape of the datasets for training and validation. This step can be coded almost identically to the exercise in the chapter, with a different number (and type) of base estimators.

  11. First, we create a new training set with additional columns for predictions from base predictors, in the same way as was done previously:

    num_base_predictors = len(train_mae_values) # 4
    
    x_train_with_metapreds = np.zeros((x_train.shape[0], x_train.shape[1]+num_base_predictors))
    x_train_with_metapreds[:, :-num_base_predictors] = x_train
    x_train_with_metapreds[:, -num_base_predictors:] = -1

    Then, we train the base models using the k-fold strategy. We save the predictions in each iteration in a list, and iterate over the list to assign the predictions to the columns in that fold:

    kf = KFold(n_splits=5, random_state=11)
    
    for train_indices, val_indices in kf.split(x_train):
        kfold_x_train, kfold_x_val = x_train.iloc[train_indices], x_train.iloc[val_indices]
        kfold_y_train, kfold_y_val = y_train[train_indices], y_train[val_indices]
        
        predictions = []
        
        dt = DecisionTreeRegressor(**dt_params)
        dt.fit(kfold_x_train, kfold_y_train)
        predictions.append(dt.predict(kfold_x_val))
    
        knn = KNeighborsRegressor(**knn_params)
        knn.fit(kfold_x_train, kfold_y_train)
        predictions.append(knn.predict(kfold_x_val))
    
        gbr = GradientBoostingRegressor(**gbr_params)
        rf.fit(kfold_x_train, kfold_y_train)
        predictions.append(rf.predict(kfold_x_val))
    
        gbr = GradientBoostingRegressor(**gbr_params)
        gbr.fit(kfold_x_train, kfold_y_train)
        predictions.append(gbr.predict(kfold_x_val))
        
        for i, preds in enumerate(predictions):
            x_train_with_metapreds[val_indices, -(i+1)] = preds

    After that, we create a new validation set with additional columns for predictions from base predictors:

    x_val_with_metapreds = np.zeros((x_val.shape[0], x_val.shape[1]+num_base_predictors))
    x_val_with_metapreds[:, :-num_base_predictors] = x_val
    x_val_with_metapreds[:, -num_base_predictors:] = -1
  12. Lastly, we fit the base models on the complete training set to get meta features for the validation set:

    predictions = []
        
    dt = DecisionTreeRegressor(**dt_params)
    dt.fit(x_train, y_train)
    predictions.append(dt.predict(x_val))
    
    knn = KNeighborsRegressor(**knn_params)
    knn.fit(x_train, y_train)
    predictions.append(knn.predict(x_val))
    
    gbr = GradientBoostingRegressor(**gbr_params)
    rf.fit(x_train, y_train)
    predictions.append(rf.predict(x_val))
    
    gbr = GradientBoostingRegressor(**gbr_params)
    gbr.fit(x_train, y_train)
    predictions.append(gbr.predict(x_val))
    
    for i, preds in enumerate(predictions):
        x_val_with_metapreds[:, -(i+1)] = preds
  13. Train a linear regression model as the stacked model. To train the stacked model, we train the logistic regression model on all the columns of the training dataset, plus the meta predictions from the base estimators. We then use the final predictions to calculate the MAE values, which we store in the same train_mae_values and val_mae_values dictionaries:

    lr = LinearRegression(normalize=False)
    lr.fit(x_train_with_metapreds, y_train)
    lr_preds_train = lr.predict(x_train_with_metapreds)
    lr_preds_val = lr.predict(x_val_with_metapreds)
    
    train_mae_values['lr'] = mean_absolute_error(y_true=y_train, y_pred=lr_preds_train)
    val_mae_values['lr'] = mean_absolute_error(y_true=y_val, y_pred=lr_preds_val)
  14. Visualize the train and validation errors for each individual model and the stacked model. Then, we will convert the dictionaries into two series and combine them to form two columns of a Pandas DataFrame:

    mae_scores = pd.concat([pd.Series(train_mae_values, name='train'), 
                            pd.Series(val_mae_values, name='val')], 
                           axis=1)
    mae_scores

    The output will be as follows:

    Figure 5.21: The train and validation errors for each individual model and the stacked model

  15. We then plot a bar chart from this DataFrame to visualize the MAE values for the train and validation sets using each model:

    mae_scores.plot(kind='bar', figsize=(10,7))
    plt.ylabel('MAE')
    plt.xlabel('Model')
    plt.show()

    The output will be as follows:

    Figure 5.22: Bar chart visualizing the MAE values

As we can see in the plot, the linear regression stacked model has the lowest value of mean absolute error on both training and validation datasets, even compared to the other ensemble models (Random Forest and gradient boosted regressor).

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime