Packt+ | Advance your knowledge in tech

You're reading from Data Science Projects with Python A case study approach to successful data science projects using Python, pandas, and scikit-learn

Product type Paperback

Published in Apr 2019

Publisher Packt

ISBN-13 9781838551025

Length 374 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Data Science

Author (1):

Stephen Klosterman

View More author details

Table of Contents (9) Chapters

Data Science Projects with Python

Preface

1. Data Exploration and Cleaning FREE CHAPTER

2. Introduction toScikit-Learn and Model Evaluation

3. Details of Logistic Regression and Feature Exploration

4. The Bias-Variance Trade-off

5. Decision Trees and Random Forests

6. Imputation of Missing Data, Financial Analysis, and Delivery to Client

Appendix

Chapter 3: Details of Logistic Regression and Feature Exploration

Activity 3: Fitting a Logistic Regression Model and Directly Using the Coefficients

The first few steps are similar to things we've done in previous activities:

Create a train/test split (80/20) with PAY_1 and LIMIT_BAL as features:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[['PAY_1', 'LIMIT_BAL']].values, df['default payment next month'].values,
test_size=0.2, random_state=24)

Import LogisticRegression, with the default options, but set the solver to 'liblinear'.

from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(solver='liblinear')

Train on the training data and obtain predicted classes, as well as class probabilities, using the testing data:
```
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
y_pred_proba = lr_model.predict_proba(X_test)
```
Pull out the coefficients and intercept from the trained model and manually calculate predicted probabilities. You'll need to add a column of 1s to your features, to multiply by the intercept.
First, let's create the array of features, with a column of 1s added, using horizontal stacking:
```
ones_and_features = np.hstack([np.ones((X_test.shape[0],1)), X_test])
```
Now we need the intercept and coefficients, which we reshape and concatenate from scikit-learn output:
```
intercept_and_coefs = np.concatenate([lr_model.intercept_.reshape(1,1), lr_model.coef_], axis=1)
```
To repeatedly multiply the intercept and coefficients by the all the rows of ones_and_features, and take the sum of each row (that is, find the linear combination), you could write this all out using multiplication and addition. However, it's much faster to use the dot product:
```
X_lin_comb = np.dot(intercept_and_coefs, np.transpose(ones_and_features))
```
Now X_lin_comb has the argument we need to pass to the sigmoid function we defined, in order to calculate predicted probabilities:
```
y_pred_proba_manual = sigmoid(X_lin_comb)
```
Using a threshold of 0.5, manually calculate predicted classes. Compare this to the class predictions output by scikit-learn.
The manually predicted probabilities, y_pred_proba_manual, should be the same as y_pred_proba; we'll check that momentarily. First, manually predict the classes with the threshold:
```
y_pred_manual = y_pred_proba_manual >= 0.5
```
This array will have a different shape than y_pred, but it should contain the same values. We can check whether all the elements of two arrays are equal like this:
Figure 6.52: Equality of NumPy arrays
Calculate ROC AUC using both scikit-learn's predicted probabilities, and your manually predicted probabilities, and compare.
First, import the following:
```
from sklearn.metrics import roc_auc_score
```
Then, calculate this metric on both versions, taking care to access the correct column, or reshape as necessary:
Figure 6.53: Calculating the ROC AUC's from predicted probabilities

The AUCs are, in fact, the same. What have we done here? We've confirmed that all we really need from this fitted scikit-learn model, are three numbers: the intercept and the two coefficients. Once we have these, we could create model predictions using a few lines of code, with mathematical functions, that are equivalent to the predictions directly made from scikit-learn.

This is good to confirm your understanding, but otherwise, why would you ever want to do this? We'll talk about model deployment in the final chapter. However, depending on your circumstances, you may be in a situation where you don't have access to Python in the environment where new features will need to be input to the model for prediction. For example, you may need to make predictions entirely in SQL. While this is a limitation in general, with logistic regression you can use mathematical functions that are available in SQL to re-create the logistic regression prediction, only needing to copy and paste the intercept and coefficients somewhere in your SQL code. The dot product may not be available, but you can use multiplication and addition to accomplish the same purpose.

Now, what about the results themselves? What we've seen here is that we can slightly boost model performance above our previous efforts: using just LIMIT_BAL as a feature in the previous chapter's Activity, the ROC AUC was a bit less at 0.62, instead of 0.63 here. In the next chapter, we'll learn advanced techniques with logistic regression that we can use to boost performance higher than this.

The rest of the chapter is locked

You're reading from Data Science Projects with Python A case study approach to successful data science projects using Python, pandas, and scikit-learn

Table of Contents (9) Chapters

Chapter 3: Details of Logistic Regression and Feature Exploration

Activity 3: Fitting a Logistic Regression Model and Directly Using the Coefficients

Authors (1)

Other recommended products

Personalised recommendations for you

You're reading from Data Science Projects with Python A case study approach to successful data science projects using Python, pandas, and scikit-learn

Table of Contents (9) Chapters

Chapter 3: Details of Logistic Regression and Feature Exploration

Activity 3: Fitting a Logistic Regression Model and Directly Using the Coefficients

Unlock this book and the full library FREE for 7 days

Authors (1)

Other recommended products

Personalised recommendations for you