Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science Projects with Python

You're reading from   Data Science Projects with Python A case study approach to successful data science projects using Python, pandas, and scikit-learn

Arrow left icon
Product type Paperback
Published in Apr 2019
Publisher Packt
ISBN-13 9781838551025
Length 374 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Stephen Klosterman Stephen Klosterman
Author Profile Icon Stephen Klosterman
Stephen Klosterman
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Data Science Projects with Python
Preface
1. Data Exploration and Cleaning FREE CHAPTER 2. Introduction toScikit-Learn and Model Evaluation 3. Details of Logistic Regression and Feature Exploration 4. The Bias-Variance Trade-off 5. Decision Trees and Random Forests 6. Imputation of Missing Data, Financial Analysis, and Delivery to Client Appendix

Chapter 2: Introduction to Scikit-Learn and Model Evaluation


Activity 2: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve

  1. Use scikit-learn's train_test_split to make a new set of training and testing data. This time, instead of EDUCATION, use LIMIT_BAL: the account's credit limit.

    Execute the following code to do this:

    X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(
    df['LIMIT_BAL'].values.reshape(-1,1), df['default payment next month'].values,
    test_size=0.2, random_state=24)

    Notice here we create new training and testing splits, with new variable names.

  2. Train a logistic regression model using the training data from your split.

    The following code does this:

    example_lr.fit(X_train_2, y_train_2)

    We reuse the same model object, example_lr. We can re-train this object to learn the relationship between this new feature and the response. We could even try a different train/test split, if we wanted to, without creating a new model object. The existing model object has been updated in-place.

  3. Create the array of predicted probabilities for the testing data:

    Here is the code for this step:

    y_test_2_pred_proba = example_lr.predict_proba(X_test_2)
  4. Calculate the ROC AUC using the predicted probabilities and the true labels of the testing data. Compare this to the ROC AUC from using the EDUCATION feature:

    Run this code for this step:

    metrics.roc_auc_score(y_test_2, y_test_2_pred_proba[:,1])

    The output is as follows:

    Figure 6.47: Calculating the ROC AUC

    Notice we index the predicted probabilities array in order to get the predicted probability of the positive class from the second column. How does this compare to the ROC AUC from the EDUCATION logistic regression? The AUC is higher. This may be because now we are using a feature that has something to do with an account's financial status (credit limit), to predict something else related to the account's financial status (whether or not it will default), instead of using something less directly related to finances.

  5. Plot the ROC curve.

    Here is the code to do this; it's similar to the code we used in the previous exercise:

    fpr_2, tpr_2, thresholds_2 = metrics.roc_curve(y_test_2, y_test_2_pred_proba[:,1])
    plt.plot(fpr_2, tpr_2, '*-')
    plt.plot([0, 1], [0, 1], 'r--')
    plt.legend(['Logistic regression', 'Random chance'])
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.title('ROC curve for logistic regression with LIMIT_BAL feature')

    The plot should appear as follows:

    Figure 6.48: ROC curve for the LIMIT_BAL logistic regression

    This looks a little closer to an ROC curve that we'd like to see: it's a bit further from the random chance line than the model using only EDUCATION. Also notice that the variation in pairs of true and false positive rates is a little smoother over the range of thresholds, reflective of the larger number of distinct values of the LIMIT_BAL feature.

  6. Calculate the data for the precision-recall curve on the testing data using scikit-learn functionality.

    Precision is often considered in tandem with recall. You are already familiar with recall. This is just another word for the true positive rate. We can use precision_recall_curve in sklearn.metrics to automatically vary the threshold and calculate pairs of precision and recall values at each one. Here is the code to retrieve these values, which is similar to roc_curve:

    precision, recall, thresh_3 = \
    metrics.precision_recall_curve(y_test_2, y_test_2_pred_proba[:,1])
  7. Plot the precision-recall curve using matplotlib: we can do this with the following code.

    Note that we put recall on the x-axis, precision on the y-axis, and set the axes limits to the range [0, 1]:

    plt.plot(recall, precision, '-x')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Another logistic regression with just one feature: LIMIT_BAL')
    plt.xlim([0, 1])
    plt.ylim([0, 1])

    Figure 6.49: Plot of the precision-recall curve

  8. Use scikit-learn to calculate the area under the precision-recall curve.

    Here is the code for this:

    metrics.auc(recall, precision)

    You will obtain the following output:

    Figure 6.50: Area under the precision-recall curve

    We saw that the precision-recall curve shows that precision is generally fairly low for this model; for nearly all of the range of thresholds, the precision, or portion of positive classifications that are correct, is less than half. We can calculate the area under the precision-recall curve as a way to compare this classifier with other models or feature sets we may consider.

    Scikit-learn offers functionality for calculating an area under the curve for any set of x-y data, using the trapezoid rule, which you may recall from calculus: metrics.auc. We used this functionality to get the area under the precision-recall curve.

  9. Now recalculate the ROC AUC, except this time do it for the training data. How is this different, conceptually and quantitatively, from your earlier calculation?

    First, we need to calculate predicted probabilities using the training data, as opposed to the testing data. Then we can calculate the ROC AUC using the training data labels. Here is the code:

    y_train_2_pred_proba = example_lr.predict_proba(X_train_2)
    metrics.roc_auc_score(y_train_2, y_train_2_pred_proba[:,1])

    You should obtain the following output:

    Figure 6.51: Training data ROC AUC

Quantitatively, we can see that this AUC is not all that different from the testing data ROC AUC we calculated earlier. Both are about 0.62. Conceptually, what is the difference? When we calculate this metric on the training data, we are measuring the model's skill in predicting the same data that "taught" the model how to make predictions. We are seeing how well the model fits the data. When we compare this to a testing data metric, we are comparing training and testing scores. If there was much of a difference in these scores, which usually would come in the form of a higher training score than testing score, it would indicate that while the model fits the data well, the trained model does not generalize well to new, unseen data.

In this case, the training and testing scores are similar, meaning the model does about as well on out-of-sample data as it does on the same data used in model training. We will learn more about the insights we can gain by comparing training and testing scores in Chapter 4, The Bias-Variance Trade-off.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at ₹800/month. Cancel anytime