Packt+ | Advance your knowledge in tech

You're reading from Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Product type Paperback

Published in Mar 2019

Publisher

ISBN-13 9781789959413

Length 420 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Science

Authors (3):

Tommy Blanchard

Debasish Behera

Pranshu Bhatnagar

View More author details

Table of Contents (12) Chapters

Data Science for Marketing Analytics

Preface

1. Data Preparation and Cleaning FREE CHAPTER

2. Data Exploration and Visualization

3. Unsupervised Learning: Customer Segmentation

4. Choosing the Best Segmentation Approach

5. Predicting Customer Revenue Using Linear Regression

6. Other Regression Techniques and Tools for Evaluation

7. Supervised Learning: Predicting Customer Churn

8. Fine-Tuning Classification Algorithms

9. Modeling Customer Choice

Appendix

Chapter 8: Fine-Tuning Classification Algorithms

Activity 15: Implementing Different Classification Algorithms

Import the logistic regression library:

from sklearn.linear_model import LogisticRegression

Fit the model:

clf_logistic = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train[top7_features], y_train)
clf_logistic

Score the model:

clf_logistic.score(X_test[top7_features], y_test)

Import the svm library:
```
from sklearn import svm
```

Fit the model:

clf_svm=svm.SVC(kernel='linear', C=1)
clf_svm.fit(X_train[top7_features],y_train)

Score the model:

clf_svm.score(X_test[top7_features], y_test)

Import the decision tree library:
```
from sklearn import tree
```

Fit the model:

clf_decision = tree.DecisionTreeClassifier()
clf_decision.fit(X_train[top7_features],y_train)

Score the model:

clf_decision.score(X_test[top7_features], y_test)

Import a random forest library:

from sklearn.ensemble import RandomForestClassifier

Fit the model:

clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,     min_samples_split=7, random_state=0)
clf_random.fit(X_train[top7_features], y_train)

Score the model.

clf_random.score(X_test[top7_features], y_test)

From the results, you can conclude that the random forest has outperformed the rest of the algorithms, with the decision tree having the lowest accuracy. In a later section, you will learn why accuracy is not the correct way to find a model's performance.

Activity 16: Tuning and Optimizing the Model

Store five out of seven features, that is, Avg_Calls_Weekdays, Current_Bill_Amt, Avg_Calls, Account_Age, and Avg_Days_Delinquent in a variable top5_features. Store the other two features, Percent_Increase_MOM and Complaint_Code, in a variable top5_features.
```
from sklearn import preprocessing
## Features to transform
top5_features=['Avg_Calls_Weekdays', 'Current_Bill_Amt', 'Avg_Calls', 'Account_Age','Avg_Days_Delinquent']
## Features Left
top2_features=['Percent_Increase_MOM','Complaint_Code']
```

Use StandardScalar to standardize the five features.

scaler = preprocessing.StandardScaler().fit(X_train[top5_features])
X_train_scalar=pd.DataFrame(scaler.transform(X_train[top5_features]),columns = X_train[top5_features].columns)

Create a variable X_train_scalar_combined, combine the standardized five features with the two features (Percent_Increase_MOM and Complaint_Code), which were not standardized.
```
X_train_scalar_combined=pd.concat([X_train_scalar,  X_train[top2_features].reset_index(drop=True)], axis=1, sort=False)
```

Apply the same scalar standardization to the test data (X_test_scalar_combined).

X_test_scalar_combined=pd.concat([X_test_scalar,  X_test[top2_features].reset_index(drop=True)], axis=1, sort=False)

Fit the random forest model.

clf_random.fit(X_train_scalar_combined, y_train)

Score the random forest model.

clf_random.score(X_test_scalar_combined, y_test)

Import the library for grid search and use the given parameters:

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
parameters = [ {'min_samples_split': [4,5,7,9,10], 'n_estimators':[10,20,30,40,50,100,150,160,200,250,300],'max_depth': [2,5,7,10]}]

Use grid search CV with stratified k-fold to find out the best parameters.

clf_random_grid = GridSearchCV(RandomForestClassifier(), parameters, cv = StratifiedKFold(n_splits = 10))
clf_random_grid.fit(X_train_scalar_combined, y_train)

Print the best score and best parameters.

print('best score train:', clf_random_grid.best_score_)
print('best parameters train: ', clf_random_grid.best_params_)

Score the model using the test data.

clf_random_grid.score(X_test_scalar_combined, y_test)

Activity 17: Comparison of the Models

Import the required libraries.

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn import metrics

Fit the random forest classifier with the parameters obtained from grid search.

clf_random_grid = RandomForestClassifier(n_estimators=100, max_depth=7,
     min_samples_split=10, random_state=0)
clf_random_grid.fit(X_train_scalar_combined, y_train)

Predict on the standardized scalar test data X_test_scalar_combined.
```
y_pred=clf_random_grid.predict(X_test_scalar_combined)
```

Fit the classification report.

target_names = ['No Churn', 'Churn']
print(classification_report(y_test, y_pred, target_names=target_names))

Plot the confusion matrix.

cm = confusion_matrix(y_test, y_pred) 
cm_df = pd.DataFrame(cm,
                     index = ['No Churn','Churn'], 
                     columns = ['No Churn','Churn'])
plt.figure(figsize=(8,6))
sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
plt.ylabel('True Values')
plt.xlabel('Predicted Values')
plt.show()

Import the library for auc and roc curve.

from sklearn.metrics import roc_curve,auc

Use the classifiers which were created in our previous activity, that is, clf_logistic, clf_svm, clf_decision, and clf_random_grid. Create a dictionary of all these models.

models = [
{
    'label': 'Logistic Regression',
    'model': clf_logistic,
},
{
    'label': 'SVM',
    'model': clf_svm,
},
{
    'label': 'Decision Tree',
    'model': clf_decision,
},
{
    'label': 'Random Forest Grid Search',
    'model': clf_random_grid,
}
]

Plot the ROC curve.

for m in models:
    model = m['model'] 
    model.fit(X_train_scalar_combined, y_train) 
    y_pred=model.predict(X_test_scalar_combined) 
    fpr, tpr, thresholds = roc_curve(y_test, y_pred, pos_label=1)
    roc_auc = metrics.auc(fpr, tpr)
    plt.plot(fpr, tpr, label='%s AUC = %0.2f' % (m['label'], roc_auc))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.ylabel('Sensitivity(True Positive Rate)')
plt.xlabel('1-Specificity(False Positive Rate)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Comparing the AUC result of different algorithms (logistic regression: 0.78; SVM: 0.79, decision tree: 0.77, and random forest: 0.82), we can conclude that random forest is the best performing model with the AUC score of 0.82 and can be chosen for the marketing team to predict customer churn.