Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science Projects with Python

You're reading from   Data Science Projects with Python A case study approach to successful data science projects using Python, pandas, and scikit-learn

Arrow left icon
Product type Paperback
Published in Apr 2019
Publisher Packt
ISBN-13 9781838551025
Length 374 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Stephen Klosterman Stephen Klosterman
Author Profile Icon Stephen Klosterman
Stephen Klosterman
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Data Science Projects with Python
Preface
1. Data Exploration and Cleaning FREE CHAPTER 2. Introduction toScikit-Learn and Model Evaluation 3. Details of Logistic Regression and Feature Exploration 4. The Bias-Variance Trade-off 5. Decision Trees and Random Forests 6. Imputation of Missing Data, Financial Analysis, and Delivery to Client Appendix

Chapter 3: Details of Logistic Regression and Feature Exploration


Activity 3: Fitting a Logistic Regression Model and Directly Using the Coefficients

The first few steps are similar to things we've done in previous activities:

  1. Create a train/test split (80/20) with PAY_1 and LIMIT_BAL as features:

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(
    df[['PAY_1', 'LIMIT_BAL']].values, df['default payment next month'].values,
    test_size=0.2, random_state=24)
  2. Import LogisticRegression, with the default options, but set the solver to 'liblinear'.

    from sklearn.linear_model import LogisticRegression
    lr_model = LogisticRegression(solver='liblinear')
  3. Train on the training data and obtain predicted classes, as well as class probabilities, using the testing data:

    lr_model.fit(X_train, y_train)
    y_pred = lr_model.predict(X_test)
    y_pred_proba = lr_model.predict_proba(X_test)
  4. Pull out the coefficients and intercept from the trained model and manually calculate predicted probabilities. You'll need to add a column of 1s to your features, to multiply by the intercept.

    First, let's create the array of features, with a column of 1s added, using horizontal stacking:

    ones_and_features = np.hstack([np.ones((X_test.shape[0],1)), X_test])

    Now we need the intercept and coefficients, which we reshape and concatenate from scikit-learn output:

    intercept_and_coefs = np.concatenate([lr_model.intercept_.reshape(1,1), lr_model.coef_], axis=1)

    To repeatedly multiply the intercept and coefficients by the all the rows of ones_and_features, and take the sum of each row (that is, find the linear combination), you could write this all out using multiplication and addition. However, it's much faster to use the dot product:

    X_lin_comb = np.dot(intercept_and_coefs, np.transpose(ones_and_features))

    Now X_lin_comb has the argument we need to pass to the sigmoid function we defined, in order to calculate predicted probabilities:

    y_pred_proba_manual = sigmoid(X_lin_comb)
  5. Using a threshold of 0.5, manually calculate predicted classes. Compare this to the class predictions output by scikit-learn.

    The manually predicted probabilities, y_pred_proba_manual, should be the same as y_pred_proba; we'll check that momentarily. First, manually predict the classes with the threshold:

    y_pred_manual = y_pred_proba_manual >= 0.5

    This array will have a different shape than y_pred, but it should contain the same values. We can check whether all the elements of two arrays are equal like this:

    Figure 6.52: Equality of NumPy arrays

  6. Calculate ROC AUC using both scikit-learn's predicted probabilities, and your manually predicted probabilities, and compare.

    First, import the following:

    from sklearn.metrics import roc_auc_score

    Then, calculate this metric on both versions, taking care to access the correct column, or reshape as necessary:

    Figure 6.53: Calculating the ROC AUC's from predicted probabilities

The AUCs are, in fact, the same. What have we done here? We've confirmed that all we really need from this fitted scikit-learn model, are three numbers: the intercept and the two coefficients. Once we have these, we could create model predictions using a few lines of code, with mathematical functions, that are equivalent to the predictions directly made from scikit-learn.

This is good to confirm your understanding, but otherwise, why would you ever want to do this? We'll talk about model deployment in the final chapter. However, depending on your circumstances, you may be in a situation where you don't have access to Python in the environment where new features will need to be input to the model for prediction. For example, you may need to make predictions entirely in SQL. While this is a limitation in general, with logistic regression you can use mathematical functions that are available in SQL to re-create the logistic regression prediction, only needing to copy and paste the intercept and coefficients somewhere in your SQL code. The dot product may not be available, but you can use multiplication and addition to accomplish the same purpose.

Now, what about the results themselves? What we've seen here is that we can slightly boost model performance above our previous efforts: using just LIMIT_BAL as a feature in the previous chapter's Activity, the ROC AUC was a bit less at 0.62, instead of 0.63 here. In the next chapter, we'll learn advanced techniques with logistic regression that we can use to boost performance higher than this.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime