Chapter 6: Model Evaluation
Activity 15: Final Test Project
Solution
Import the relevant libraries:
import pandas as pd import numpy as np import json %matplotlib inline import matplotlib.pyplot as plt from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import RandomizedSearchCV, train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import (accuracy_score, precision_score, recall_score, confusion_matrix, precision_recall_curve)
Read the attrition_train.csv dataset. Read the CSV file into a DataFrame and print the .info() of the DataFrame:
data = pd.read_csv('attrition_train.csv') data.info()
The output will be as follows:
Read the JSON file with the details of the categorical variables. The JSON file contains a dictionary, where the keys are the column names of the categorical features and the corresponding values are the list of categories in the feature. This file will help us one-hot encode the categorical features into numerical features. Use the json library to load the file object into a dictionary, and print the dictionary:
with open('categorical_variable_values.json', 'r') as f: cat_values_dict = json.load(f) cat_values_dict
The output will be as follows:
Process the dataset to convert all features to numerical values. First, find the number of columns that will stay in their original form (that is, numerical features) and that need to be one-hot encoded (that is, the categorical features). data.shape[1] gives us the number of columns in data, and we subtract len(cat_values_dict) from it to get the number of numerical columns. To find the number of categorical columns, we simply count the total number of categories across all categorical variables from the cat_values_dict dictionary:
num_orig_cols = data.shape[1] - len(cat_values_dict) num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()]) print(num_orig_cols, num_enc_cols)
The output will be:
26 24
Create a NumPy array of zeros as a placeholder, with a shape equal to the total number of columns, as determined previously, minus one (because the Attrition target variable is also included here). For the numerical columns, we then create a mask that selects the numerical columns from the DataFrame and assigns them to the first num_orig_cols-1 columns in the array, X:
X = np.zeros(shape=(data.shape[0], num_orig_cols+num_enc_cols-1)) mask = [(each not in cat_values_dict and each != 'Attrition') for each in data.columns] X[:, :num_orig_cols-1] = data.loc[:, data.columns[mask]]
Next, we initialize the OneHotEncoder class from scikit-learn with a list containing the list of values in each categorical column. Then, we transform the categorical columns to one-hot encoded columns and assign them to the remaining columns in X, and save the values of the target variable in the y variable:
cat_cols = list(cat_values_dict.keys()) cat_values = [cat_values_dict[col] for col in data[cat_cols].columns] ohe = OneHotEncoder(categories=cat_values, sparse=False, ) X[:, num_orig_cols-1:] = ohe.fit_transform(X=data[cat_cols]) y = data.Attrition.values print(X.shape) print(y.shape)
The output will be:
(1176, 49) (1176,)
Choose a base model and define the range of hyperparameter values corresponding to the model to be searched over for hyperparameter tuning. Let's use a gradient boosted classifier as our model. We then define ranges of values for all hyperparameters we want to tune in the form of a dictionary:
meta_gbc = GradientBoostingClassifier() param_dist = { 'n_estimators': list(range(10, 210, 10)), 'criterion': ['mae', 'mse'], 'max_features': ['sqrt', 'log2', 0.25, 0.3, 0.5, 0.8, None], 'max_depth': list(range(1, 10)), 'min_samples_leaf': list(range(1, 10)) }
Define the parameters with which to initialize the RandomizedSearchCV object and use K-fold cross-validation to find the best model hyperparameters. Define the parameters required for random search, including cv as 5, indicating that the hyperparameters should be chosen by evaluating the performance using 5-fold cross-validation. Then, initialize the RandomizedSearchCV object and use the .fit() method to begin the optimization:
rand_search_params = { 'param_distributions': param_dist, 'scoring': 'accuracy', 'n_iter': 100, 'cv': 5, 'return_train_score': True, 'n_jobs': -1, 'random_state': 11 } random_search = RandomizedSearchCV(meta_gbc, **rand_search_params) random_search.fit(X, y)
The output will be as follows:
Once the tuning is complete, find the position (iteration number) at which the highest mean test score was obtained. Find the corresponding hyperparameters and save them to a dictionary:
idx = np.argmax(random_search.cv_results_['mean_test_score']) final_params = random_search.cv_results_['params'][idx] final_params
The output will be:
Split the dataset into training and validation sets and train a new model using the final hyperparameters on the training dataset. Use scikit-learn's train_test_split() method to split X and y into train and test components, with test comprising 15% of the dataset:
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.15, random_state=11) print(train_X.shape, train_y.shape, val_X.shape, val_y.shape)
The output will be:
((999, 49), (999,), (177, 49), (177,))
Train the gradient boosted classification model using the final hyperparameters and make predictions on the training and validation sets. Also calculate the probability on the validation set:
gbc = GradientBoostingClassifier(**final_params) gbc.fit(train_X, train_y) preds_train = gbc.predict(train_X) preds_val = gbc.predict(val_X) pred_probs_val = np.array([each[1] for each in gbc.predict_proba(val_X)])
Calculate the accuracy, precision, and recall for predictions on the validation set, and print the confusion matrix:
print('train accuracy_score = {}'.format(accuracy_score(y_true=train_y, y_pred=preds_train))) print('validation accuracy_score = {}'.format(accuracy_score(y_true=val_y, y_pred=preds_val))) print('confusion_matrix: \n{}'.format(confusion_matrix(y_true=val_y, y_pred=preds_val))) print('precision_score = {}'.format(precision_score(y_true=val_y, y_pred=preds_val))) print('recall_score = {}'.format(recall_score(y_true=val_y, y_pred=preds_val)))
The output will be as follows:
Experiment with varying thresholds to find the optimal point with high recall.
Plot the precision-recall curve:
plt.figure(figsize=(10,7)) precision, recall, thresholds = precision_recall_curve(val_y, pred_probs_val) plt.plot(recall, precision) plt.xlabel('Recall') plt.ylabel('Precision') plt.show()
The output will be as follows:
Plot the variation in precision and recall with increasing threshold values:
PR_variation_df = pd.DataFrame({'precision': precision, 'recall': recall}, index=list(thresholds)+[1]) PR_variation_df.plot(figsize=(10,7)) plt.xlabel('Threshold') plt.ylabel('P/R values') plt.show()
The output will be as follows:
Finalize a threshold that will be used for predictions on the test dataset. Let's finalize a value, say, 0.3. This value is entirely dependent on what you feel would be optimal based on your exploration in the previous step:
final_threshold = 0.3
Read and process the test dataset to convert all features to numerical values. This will be done in a manner similar to that in step 4, with the only difference that we don't need to account for the target variable column, as the dataset does not contain it:
test = pd.read_csv('attrition_test.csv') test.info() num_orig_cols = test.shape[1] - len(cat_values_dict) num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()]) print(num_orig_cols, num_enc_cols) test_X = np.zeros(shape=(test.shape[0], num_orig_cols+num_enc_cols)) mask = [(each not in cat_values_dict) for each in test.columns] test_X[:, :num_orig_cols] = test.loc[:, test.columns[mask]] cat_cols = list(cat_values_dict.keys()) cat_values = [cat_values_dict[col] for col in test[cat_cols].columns] ohe = OneHotEncoder(categories=cat_values, sparse=False, ) test_X[:, num_orig_cols:] = ohe.fit_transform(X=test[cat_cols]) print(test_X.shape)
Predict the final values on the test dataset and save them to a file. Use the final threshold value determined in step 10 to find the classes for each value in the training set. Then, write the final predictions to the final_predictions.csv file:
pred_probs_test = np.array([each[1] for each in gbc.predict_proba(test_X)]) preds_test = (pred_probs_test > final_threshold).astype(int) with open('final_predictions.csv', 'w') as f: f.writelines([str(val)+'\n' for val in preds_test])
The output will be a CSV file, as follows: