Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Science Projects with Python

You're reading from   Data Science Projects with Python A case study approach to successful data science projects using Python, pandas, and scikit-learn

Arrow left icon
Product type Paperback
Published in Apr 2019
Publisher Packt
ISBN-13 9781838551025
Length 374 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Stephen Klosterman Stephen Klosterman
Author Profile Icon Stephen Klosterman
Stephen Klosterman
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Data Science Projects with Python
Preface
1. Data Exploration and Cleaning FREE CHAPTER 2. Introduction toScikit-Learn and Model Evaluation 3. Details of Logistic Regression and Feature Exploration 4. The Bias-Variance Trade-off 5. Decision Trees and Random Forests 6. Imputation of Missing Data, Financial Analysis, and Delivery to Client Appendix

Chapter 4: The Bias-Variance Trade-off


Activity 4: Cross-Validation and Feature Engineering with the Case Study Data

  1. Select out the features from the DataFrame of the case study data.

    You can use the list of feature names that we've already created in this chapter. But be sure not to include the response variable, which would be a very good (but entirely inappropriate) feature:

    features = features_response[:-1]
    X = df[features].values
  2. Make a train/test split using a random seed of 24:

    X_train, X_test, y_train, y_test = train_test_split(X, df['default payment next month'].values,
    test_size=0.2, random_state=24)

    We'll use this going forward and reserve this testing data as the unseen test set. This way, we can easily create separate notebooks with other modeling approaches, using the same training data.

  3. Instantiate the MinMaxScaler to scale the data, as shown in the following code:

    from sklearn.preprocessing import MinMaxScaler
    min_max_sc = MinMaxScaler()
  4. Instantiate a logistic regression model with the saga solver, L1 penalty, and set max_iter to 1,000 as we'd like to allow the solver enough iterations to find a good solution:

    lr = LogisticRegression(solver='saga', penalty='l1', max_iter=1000)
  5. Import the Pipeline class and create a Pipeline with the scaler and the logistic regression model, using the names 'scaler' and 'model' for the steps, respectively:

    from sklearn.pipeline import Pipeline
    scale_lr_pipeline = Pipeline(steps=[('scaler', min_max_sc), ('model', lr)])
  6. Use the get_params and set_params methods to see how to view the parameters from each stage of the pipeline and change them:

    scale_lr_pipeline.get_params()
    scale_lr_pipeline.get_params()['model__C']
    scale_lr_pipeline.set_params(model__C = 2)
  7. Create a smaller range of C values to test with cross-validation, as these models will take longer to train and test with more data than our previous exercises; we recommend C = [102, 10, 1, 10-1, 10-2, 10-3]:

    C_val_exponents = np.linspace(2,-3,6)
    C_vals = np.float(10)**C_val_exponents
  8. Make a new version of the cross_val_C_search function, called cross_val_C_search_pipe. Instead of the model argument, this function will take a pipeline argument. The changes inside the function will be to set the C value using set_params(model__C = <value you want to test>) on the pipeline, replacing model with the pipeline for the fit and predict_proba methods, and accessing the C value using pipeline.get_params()['model__C'] for the printed status update.

    The changes are as follows:

    def cross_val_C_search_pipe(k_folds, C_vals, pipeline, X, Y):
    ##[…]
    pipeline.set_params(model__C = C_vals[c_val_counter])
    ##[…]
    pipeline.fit(X_cv_train, y_cv_train)
    ##[…]
    y_cv_train_predict_proba = pipeline.predict_proba(X_cv_train)
    ##[…]
    y_cv_test_predict_proba = pipeline.predict_proba(X_cv_test)
    ##[…]
    print('Done with C = {}'.format(pipeline.get_params()['model__C']))

    Note

    For the complete code, refer to http://bit.ly/2ZAy2Pr.

  9. Run this function as in the previous exercise, but using the new range of C values, the pipeline you created, and the features and response variable from the training split of the case study data. You may see warnings here, or in later steps, about the non-convergence of the solver; you could experiment with the tol or max_iter options to try and achieve convergence, although the results you obtain with max_iter = 1000 are likely to be sufficient. Here is the code to do this:

    cv_train_roc_auc, cv_test_roc_auc, cv_test_roc = \
    cross_val_C_search_pipe(k_folds, C_vals, scale_lr_pipeline, X_train, y_train)

    You will obtain the following output:

    Done with C = 100.0
    Done with C = 10.0
    Done with C = 1.0
    Done with C = 0.1
    Done with C = 0.01
    Done with C = 0.001
  10. Plot the average training and testing ROC AUC across folds, for each C value, using the following code:

    plt.plot(C_val_exponents, np.mean(cv_train_roc_auc, axis=0), '-o',
            label='Average training score')
    plt.plot(C_val_exponents, np.mean(cv_test_roc_auc, axis=0), '-x',
            label='Average testing score')
    plt.ylabel('ROC AUC')
    plt.xlabel('log$_{10}$(C)')
    plt.legend()
    plt.title('Cross validation on Case Study problem')
    np.mean(cv_test_roc_auc, axis=0)

    You will obtain the following output:

    Figure 6.54: Cross-validation testing performance

    You should notice that regularization does not impart much benefit here, as may be expected. While we are able to increase model performance over our previous efforts by using all the features available, it appears there is no overfitting going on. Instead, the training and testing scores are about the same. Instead of overfitting, it's possible that we may be underfitting. Let's try engineering some interaction features to see if they can improve performance.

  11. Create interaction features for the case study data and confirm that the number of new features makes sense using the following code:

    from sklearn.preprocessing import PolynomialFeatures
    make_interactions = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
    X_interact = make_interactions.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(
    X_interact, df['default payment next month'].values,
    test_size=0.2, random_state=24)
    print(X_train.shape)
    print(X_test.shape)

    You will obtain the following output:

    (21331, 153)
    (5333, 153)

    From this you should see the new number of features is 153, which is 17 + "17 choose 2" = 17 + 136 = 153. The "17 choose 2" part comes from choosing all possible combinations of 2 features to interact from the possible 17.

  12. Repeat the cross-validation procedure and observe the model performance now; that is, repeat Steps 9 and 10. Note that this will take substantially more time, due to the larger number of features, but it will probably take only a few minutes.

    You will obtain the following output:

    Figure 6.55: Improved cross-validation testing performance from adding interaction features

So, does the average cross-validation testing performance improve with the interaction features? Is regularization useful?

Engineering the interaction features increases the best model testing score to about ROC AUC = 0.74 on average across the folds, from about 0.72 without including interactions. These scores happen at C = 100, that is, with negligible regularization. On the plot of training versus testing scores for the model with interactions, you can see that the training score is a bit higher than the testing score, so it could be said that some amount of overfitting is going on. However, we cannot increase the testing score through regularization here, so this may not be a problematic instance of overfitting. In most cases, whatever strategy yields the highest testing score is the best strategy.

We will reserve the step of fitting on all the training data for later, when we've tried other models in cross-validation to find the best model.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image