Packt+ | Advance your knowledge in tech

You're reading from Applied Supervised Learning with Python Use scikit-learn to build predictive models from real-world datasets and prepare yourself for the future of machine learning

Product type Paperback

Published in Apr 2019

Publisher

ISBN-13 9781789954920

Length 404 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (2):

Ishita Mathur

Benjamin Johnston

View More author details

Table of Contents (9) Chapters

Applied Supervised Learning with Python

Preface

1. Python Machine Learning Toolkit

2. Exploratory Data Analysis and Visualization FREE CHAPTER

3. Regression Analysis

4. Classification

5. Ensemble Modeling

6. Model Evaluation

Appendix

Chapter 6: Model Evaluation

Activity 15: Final Test Project

Solution

Import the relevant libraries:

import pandas as pd
import numpy as np
import json

%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, confusion_matrix, precision_recall_curve)

Read the attrition_train.csv dataset. Read the CSV file into a DataFrame and print the .info() of the DataFrame:
```
data = pd.read_csv('attrition_train.csv')
data.info()
```
The output will be as follows:
Figure 6.33: Output of info()
Read the JSON file with the details of the categorical variables. The JSON file contains a dictionary, where the keys are the column names of the categorical features and the corresponding values are the list of categories in the feature. This file will help us one-hot encode the categorical features into numerical features. Use the json library to load the file object into a dictionary, and print the dictionary:
```
with open('categorical_variable_values.json', 'r') as f:
    cat_values_dict = json.load(f)
cat_values_dict
```
The output will be as follows:
Figure 6.34: The JSON file
Process the dataset to convert all features to numerical values. First, find the number of columns that will stay in their original form (that is, numerical features) and that need to be one-hot encoded (that is, the categorical features). data.shape[1] gives us the number of columns in data, and we subtract len(cat_values_dict) from it to get the number of numerical columns. To find the number of categorical columns, we simply count the total number of categories across all categorical variables from the cat_values_dict dictionary:
```
num_orig_cols = data.shape[1] - len(cat_values_dict)
num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()])
print(num_orig_cols, num_enc_cols)
```
The output will be:
```
26 24
```
Create a NumPy array of zeros as a placeholder, with a shape equal to the total number of columns, as determined previously, minus one (because the Attrition target variable is also included here). For the numerical columns, we then create a mask that selects the numerical columns from the DataFrame and assigns them to the first num_orig_cols-1 columns in the array, X:
```
X = np.zeros(shape=(data.shape[0], num_orig_cols+num_enc_cols-1))

mask = [(each not in cat_values_dict and each != 'Attrition') for each in data.columns]
X[:, :num_orig_cols-1] = data.loc[:, data.columns[mask]]
```
Next, we initialize the OneHotEncoder class from scikit-learn with a list containing the list of values in each categorical column. Then, we transform the categorical columns to one-hot encoded columns and assign them to the remaining columns in X, and save the values of the target variable in the y variable:
```
cat_cols = list(cat_values_dict.keys())
cat_values = [cat_values_dict[col] for col in data[cat_cols].columns]

ohe = OneHotEncoder(categories=cat_values, sparse=False, )

X[:, num_orig_cols-1:] = ohe.fit_transform(X=data[cat_cols])
y = data.Attrition.values

print(X.shape)
print(y.shape)
```
The output will be:
```
(1176, 49)
(1176,)
```
Choose a base model and define the range of hyperparameter values corresponding to the model to be searched over for hyperparameter tuning. Let's use a gradient boosted classifier as our model. We then define ranges of values for all hyperparameters we want to tune in the form of a dictionary:
```
meta_gbc = GradientBoostingClassifier()

param_dist = {
    'n_estimators': list(range(10, 210, 10)),
    'criterion': ['mae', 'mse'],
    'max_features': ['sqrt', 'log2', 0.25, 0.3, 0.5, 0.8, None],
    'max_depth': list(range(1, 10)),
    'min_samples_leaf': list(range(1, 10))
}
```
Define the parameters with which to initialize the RandomizedSearchCV object and use K-fold cross-validation to find the best model hyperparameters. Define the parameters required for random search, including cv as 5, indicating that the hyperparameters should be chosen by evaluating the performance using 5-fold cross-validation. Then, initialize the RandomizedSearchCV object and use the .fit() method to begin the optimization:
```
rand_search_params = {
    'param_distributions': param_dist,
    'scoring': 'accuracy',
    'n_iter': 100,
    'cv': 5,
    'return_train_score': True,
    'n_jobs': -1,
    'random_state': 11
}
random_search = RandomizedSearchCV(meta_gbc, **rand_search_params)
random_search.fit(X, y)
```
The output will be as follows:
Figure 6.35: Output of the optimization process
Once the tuning is complete, find the position (iteration number) at which the highest mean test score was obtained. Find the corresponding hyperparameters and save them to a dictionary:
```
idx = np.argmax(random_search.cv_results_['mean_test_score'])
final_params = random_search.cv_results_['params'][idx]
final_params
```
The output will be:
Figure 6.36: The hyperparameters dictionary
Split the dataset into training and validation sets and train a new model using the final hyperparameters on the training dataset. Use scikit-learn's train_test_split() method to split X and y into train and test components, with test comprising 15% of the dataset:
```
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.15, random_state=11)
print(train_X.shape, train_y.shape, val_X.shape, val_y.shape)
```
The output will be:
```
((999, 49), (999,), (177, 49), (177,))
```
Train the gradient boosted classification model using the final hyperparameters and make predictions on the training and validation sets. Also calculate the probability on the validation set:
```
gbc = GradientBoostingClassifier(**final_params)
gbc.fit(train_X, train_y)

preds_train = gbc.predict(train_X)
preds_val = gbc.predict(val_X)
pred_probs_val = np.array([each[1] for each in gbc.predict_proba(val_X)])
```

Calculate the accuracy, precision, and recall for predictions on the validation set, and print the confusion matrix:

print('train accuracy_score = {}'.format(accuracy_score(y_true=train_y, y_pred=preds_train)))
print('validation accuracy_score = {}'.format(accuracy_score(y_true=val_y, y_pred=preds_val)))

print('confusion_matrix: \n{}'.format(confusion_matrix(y_true=val_y, y_pred=preds_val)))
print('precision_score = {}'.format(precision_score(y_true=val_y, y_pred=preds_val)))
print('recall_score = {}'.format(recall_score(y_true=val_y, y_pred=preds_val)))

The output will be as follows:

Figure 6.37: Accuracy, precision, recall, and the confusion matrix

Experiment with varying thresholds to find the optimal point with high recall.

Plot the precision-recall curve:

plt.figure(figsize=(10,7))

precision, recall, thresholds = precision_recall_curve(val_y, pred_probs_val)
plt.plot(recall, precision)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

The output will be as follows:

Figure 6.38: The precision-recall curve

Plot the variation in precision and recall with increasing threshold values:

PR_variation_df = pd.DataFrame({'precision': precision, 'recall': recall}, index=list(thresholds)+[1])

PR_variation_df.plot(figsize=(10,7))
plt.xlabel('Threshold')
plt.ylabel('P/R values')
plt.show()

The output will be as follows:

Figure 6.39: Variation in precision and recall with increasing threshold values

Finalize a threshold that will be used for predictions on the test dataset. Let's finalize a value, say, 0.3. This value is entirely dependent on what you feel would be optimal based on your exploration in the previous step:
```
final_threshold = 0.3
```

Read and process the test dataset to convert all features to numerical values. This will be done in a manner similar to that in step 4, with the only difference that we don't need to account for the target variable column, as the dataset does not contain it:

test = pd.read_csv('attrition_test.csv')
test.info()


num_orig_cols = test.shape[1] - len(cat_values_dict)
num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()])
print(num_orig_cols, num_enc_cols)


test_X = np.zeros(shape=(test.shape[0], num_orig_cols+num_enc_cols))

mask = [(each not in cat_values_dict) for each in test.columns]
test_X[:, :num_orig_cols] = test.loc[:, test.columns[mask]]

cat_cols = list(cat_values_dict.keys())
cat_values = [cat_values_dict[col] for col in test[cat_cols].columns]

ohe = OneHotEncoder(categories=cat_values, sparse=False, )

test_X[:, num_orig_cols:] = ohe.fit_transform(X=test[cat_cols])
print(test_X.shape)

Predict the final values on the test dataset and save them to a file. Use the final threshold value determined in step 10 to find the classes for each value in the training set. Then, write the final predictions to the final_predictions.csv file:
```
pred_probs_test = np.array([each[1] for each in gbc.predict_proba(test_X)])
preds_test = (pred_probs_test > final_threshold).astype(int)

with open('final_predictions.csv', 'w') as f:
    f.writelines([str(val)+'\n' for val in preds_test])
```
The output will be a CSV file, as follows:
Figure 6.40: The CSV file