Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Applied Supervised Learning with Python

You're reading from   Applied Supervised Learning with Python Use scikit-learn to build predictive models from real-world datasets and prepare yourself for the future of machine learning

Arrow left icon
Product type Paperback
Published in Apr 2019
Publisher
ISBN-13 9781789954920
Length 404 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Ishita Mathur Ishita Mathur
Author Profile Icon Ishita Mathur
Ishita Mathur
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Arrow right icon
View More author details
Toc

Chapter 6: Model Evaluation


Activity 15: Final Test Project

Solution

  1. Import the relevant libraries:

    import pandas as pd
    import numpy as np
    import json
    
    %matplotlib inline
    import matplotlib.pyplot as plt
    
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.model_selection import RandomizedSearchCV, train_test_split
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.metrics import (accuracy_score, precision_score, recall_score, confusion_matrix, precision_recall_curve)
  2. Read the attrition_train.csv dataset. Read the CSV file into a DataFrame and print the .info() of the DataFrame:

    data = pd.read_csv('attrition_train.csv')
    data.info()

    The output will be as follows:

    Figure 6.33: Output of info()

  3. Read the JSON file with the details of the categorical variables. The JSON file contains a dictionary, where the keys are the column names of the categorical features and the corresponding values are the list of categories in the feature. This file will help us one-hot encode the categorical features into numerical features. Use the json library to load the file object into a dictionary, and print the dictionary:

    with open('categorical_variable_values.json', 'r') as f:
        cat_values_dict = json.load(f)
    cat_values_dict

    The output will be as follows:

    Figure 6.34: The JSON file

  4. Process the dataset to convert all features to numerical values. First, find the number of columns that will stay in their original form (that is, numerical features) and that need to be one-hot encoded (that is, the categorical features). data.shape[1] gives us the number of columns in data, and we subtract len(cat_values_dict) from it to get the number of numerical columns. To find the number of categorical columns, we simply count the total number of categories across all categorical variables from the cat_values_dict dictionary:

    num_orig_cols = data.shape[1] - len(cat_values_dict)
    num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()])
    print(num_orig_cols, num_enc_cols)

    The output will be:

    26 24

    Create a NumPy array of zeros as a placeholder, with a shape equal to the total number of columns, as determined previously, minus one (because the Attrition target variable is also included here). For the numerical columns, we then create a mask that selects the numerical columns from the DataFrame and assigns them to the first num_orig_cols-1 columns in the array, X:

    X = np.zeros(shape=(data.shape[0], num_orig_cols+num_enc_cols-1))
    
    mask = [(each not in cat_values_dict and each != 'Attrition') for each in data.columns]
    X[:, :num_orig_cols-1] = data.loc[:, data.columns[mask]]

    Next, we initialize the OneHotEncoder class from scikit-learn with a list containing the list of values in each categorical column. Then, we transform the categorical columns to one-hot encoded columns and assign them to the remaining columns in X, and save the values of the target variable in the y variable:

    cat_cols = list(cat_values_dict.keys())
    cat_values = [cat_values_dict[col] for col in data[cat_cols].columns]
    
    ohe = OneHotEncoder(categories=cat_values, sparse=False, )
    
    X[:, num_orig_cols-1:] = ohe.fit_transform(X=data[cat_cols])
    y = data.Attrition.values
    
    print(X.shape)
    print(y.shape)

    The output will be:

    (1176, 49)
    (1176,)
  5. Choose a base model and define the range of hyperparameter values corresponding to the model to be searched over for hyperparameter tuning. Let's use a gradient boosted classifier as our model. We then define ranges of values for all hyperparameters we want to tune in the form of a dictionary:

    meta_gbc = GradientBoostingClassifier()
    
    param_dist = {
        'n_estimators': list(range(10, 210, 10)),
        'criterion': ['mae', 'mse'],
        'max_features': ['sqrt', 'log2', 0.25, 0.3, 0.5, 0.8, None],
        'max_depth': list(range(1, 10)),
        'min_samples_leaf': list(range(1, 10))
    }
  6. Define the parameters with which to initialize the RandomizedSearchCV object and use K-fold cross-validation to find the best model hyperparameters. Define the parameters required for random search, including cv as 5, indicating that the hyperparameters should be chosen by evaluating the performance using 5-fold cross-validation. Then, initialize the RandomizedSearchCV object and use the .fit() method to begin the optimization:

    rand_search_params = {
        'param_distributions': param_dist,
        'scoring': 'accuracy',
        'n_iter': 100,
        'cv': 5,
        'return_train_score': True,
        'n_jobs': -1,
        'random_state': 11
    }
    random_search = RandomizedSearchCV(meta_gbc, **rand_search_params)
    random_search.fit(X, y)

    The output will be as follows:

    Figure 6.35: Output of the optimization process

    Once the tuning is complete, find the position (iteration number) at which the highest mean test score was obtained. Find the corresponding hyperparameters and save them to a dictionary:

    idx = np.argmax(random_search.cv_results_['mean_test_score'])
    final_params = random_search.cv_results_['params'][idx]
    final_params

    The output will be:

    Figure 6.36: The hyperparameters dictionary

  7. Split the dataset into training and validation sets and train a new model using the final hyperparameters on the training dataset. Use scikit-learn's train_test_split() method to split X and y into train and test components, with test comprising 15% of the dataset:

    train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.15, random_state=11)
    print(train_X.shape, train_y.shape, val_X.shape, val_y.shape)

    The output will be:

    ((999, 49), (999,), (177, 49), (177,))

    Train the gradient boosted classification model using the final hyperparameters and make predictions on the training and validation sets. Also calculate the probability on the validation set:

    gbc = GradientBoostingClassifier(**final_params)
    gbc.fit(train_X, train_y)
    
    preds_train = gbc.predict(train_X)
    preds_val = gbc.predict(val_X)
    pred_probs_val = np.array([each[1] for each in gbc.predict_proba(val_X)])
  8. Calculate the accuracy, precision, and recall for predictions on the validation set, and print the confusion matrix:

    print('train accuracy_score = {}'.format(accuracy_score(y_true=train_y, y_pred=preds_train)))
    print('validation accuracy_score = {}'.format(accuracy_score(y_true=val_y, y_pred=preds_val)))
    
    print('confusion_matrix: \n{}'.format(confusion_matrix(y_true=val_y, y_pred=preds_val)))
    print('precision_score = {}'.format(precision_score(y_true=val_y, y_pred=preds_val)))
    print('recall_score = {}'.format(recall_score(y_true=val_y, y_pred=preds_val)))

    The output will be as follows:

    Figure 6.37: Accuracy, precision, recall, and the confusion matrix

  9. Experiment with varying thresholds to find the optimal point with high recall.

    Plot the precision-recall curve:

    plt.figure(figsize=(10,7))
    
    precision, recall, thresholds = precision_recall_curve(val_y, pred_probs_val)
    plt.plot(recall, precision)
    
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.show()

    The output will be as follows:

    Figure 6.38: The precision-recall curve

    Plot the variation in precision and recall with increasing threshold values:

    PR_variation_df = pd.DataFrame({'precision': precision, 'recall': recall}, index=list(thresholds)+[1])
    
    PR_variation_df.plot(figsize=(10,7))
    plt.xlabel('Threshold')
    plt.ylabel('P/R values')
    plt.show()

    The output will be as follows:

    Figure 6.39: Variation in precision and recall with increasing threshold values

  10. Finalize a threshold that will be used for predictions on the test dataset. Let's finalize a value, say, 0.3. This value is entirely dependent on what you feel would be optimal based on your exploration in the previous step:

    final_threshold = 0.3
  11. Read and process the test dataset to convert all features to numerical values. This will be done in a manner similar to that in step 4, with the only difference that we don't need to account for the target variable column, as the dataset does not contain it:

    test = pd.read_csv('attrition_test.csv')
    test.info()
    
    
    num_orig_cols = test.shape[1] - len(cat_values_dict)
    num_enc_cols = sum([len(cats) for cats in cat_values_dict.values()])
    print(num_orig_cols, num_enc_cols)
    
    
    test_X = np.zeros(shape=(test.shape[0], num_orig_cols+num_enc_cols))
    
    mask = [(each not in cat_values_dict) for each in test.columns]
    test_X[:, :num_orig_cols] = test.loc[:, test.columns[mask]]
    
    cat_cols = list(cat_values_dict.keys())
    cat_values = [cat_values_dict[col] for col in test[cat_cols].columns]
    
    ohe = OneHotEncoder(categories=cat_values, sparse=False, )
    
    test_X[:, num_orig_cols:] = ohe.fit_transform(X=test[cat_cols])
    print(test_X.shape)
  12. Predict the final values on the test dataset and save them to a file. Use the final threshold value determined in step 10 to find the classes for each value in the training set. Then, write the final predictions to the final_predictions.csv file:

    pred_probs_test = np.array([each[1] for each in gbc.predict_proba(test_X)])
    preds_test = (pred_probs_test > final_threshold).astype(int)
    
    with open('final_predictions.csv', 'w') as f:
        f.writelines([str(val)+'\n' for val in preds_test])

    The output will be a CSV file, as follows:

    Figure 6.40: The CSV file

lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime