Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Science for Marketing Analytics

You're reading from   Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789959413
Length 420 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Tommy Blanchard Tommy Blanchard
Author Profile Icon Tommy Blanchard
Tommy Blanchard
Debasish Behera Debasish Behera
Author Profile Icon Debasish Behera
Debasish Behera
Pranshu Bhatnagar Pranshu Bhatnagar
Author Profile Icon Pranshu Bhatnagar
Pranshu Bhatnagar
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Data Science for Marketing Analytics
Preface
1. Data Preparation and Cleaning FREE CHAPTER 2. Data Exploration and Visualization 3. Unsupervised Learning: Customer Segmentation 4. Choosing the Best Segmentation Approach 5. Predicting Customer Revenue Using Linear Regression 6. Other Regression Techniques and Tools for Evaluation 7. Supervised Learning: Predicting Customer Churn 8. Fine-Tuning Classification Algorithms 9. Modeling Customer Choice Appendix

Chapter 8: Fine-Tuning Classification Algorithms


Activity 15: Implementing Different Classification Algorithms

  1. Import the logistic regression library:

    from sklearn.linear_model import LogisticRegression
  2. Fit the model:

    clf_logistic = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train[top7_features], y_train)
    clf_logistic
  3. Score the model:

    clf_logistic.score(X_test[top7_features], y_test)
  4. Import the svm library:

    from sklearn import svm
  5. Fit the model:

    clf_svm=svm.SVC(kernel='linear', C=1)
    clf_svm.fit(X_train[top7_features],y_train)
  6. Score the model:

    clf_svm.score(X_test[top7_features], y_test)
  7. Import the decision tree library:

    from sklearn import tree
  8. Fit the model:

    clf_decision = tree.DecisionTreeClassifier()
    clf_decision.fit(X_train[top7_features],y_train)
  9. Score the model:

    clf_decision.score(X_test[top7_features], y_test)
  10. Import a random forest library:

    from sklearn.ensemble import RandomForestClassifier
  11. Fit the model:

    clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,     min_samples_split=7, random_state=0)
    clf_random.fit(X_train[top7_features], y_train)
  12. Score the model.

    clf_random.score(X_test[top7_features], y_test)

From the results, you can conclude that the random forest has outperformed the rest of the algorithms, with the decision tree having the lowest accuracy. In a later section, you will learn why accuracy is not the correct way to find a model's performance.

Activity 16: Tuning and Optimizing the Model

  1. Store five out of seven features, that is, Avg_Calls_Weekdays, Current_Bill_Amt, Avg_Calls, Account_Age, and Avg_Days_Delinquent in a variable top5_features. Store the other two features, Percent_Increase_MOM and Complaint_Code, in a variable top5_features.

    from sklearn import preprocessing
    ## Features to transform
    top5_features=['Avg_Calls_Weekdays', 'Current_Bill_Amt', 'Avg_Calls', 'Account_Age','Avg_Days_Delinquent']
    ## Features Left
    top2_features=['Percent_Increase_MOM','Complaint_Code']
  2. Use StandardScalar to standardize the five features.

    scaler = preprocessing.StandardScaler().fit(X_train[top5_features])
    X_train_scalar=pd.DataFrame(scaler.transform(X_train[top5_features]),columns = X_train[top5_features].columns)
  3. Create a variable X_train_scalar_combined, combine the standardized five features with the two features (Percent_Increase_MOM and Complaint_Code), which were not standardized.

    X_train_scalar_combined=pd.concat([X_train_scalar,  X_train[top2_features].reset_index(drop=True)], axis=1, sort=False)
  4. Apply the same scalar standardization to the test data (X_test_scalar_combined).

    X_test_scalar_combined=pd.concat([X_test_scalar,  X_test[top2_features].reset_index(drop=True)], axis=1, sort=False)
  5. Fit the random forest model.

    clf_random.fit(X_train_scalar_combined, y_train)
  6. Score the random forest model.

    clf_random.score(X_test_scalar_combined, y_test)
  7. Import the library for grid search and use the given parameters:

    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import StratifiedKFold
    parameters = [ {'min_samples_split': [4,5,7,9,10], 'n_estimators':[10,20,30,40,50,100,150,160,200,250,300],'max_depth': [2,5,7,10]}]
  8. Use grid search CV with stratified k-fold to find out the best parameters.

    clf_random_grid = GridSearchCV(RandomForestClassifier(), parameters, cv = StratifiedKFold(n_splits = 10))
    clf_random_grid.fit(X_train_scalar_combined, y_train)
  9. Print the best score and best parameters.

    print('best score train:', clf_random_grid.best_score_)
    print('best parameters train: ', clf_random_grid.best_params_)
  10. Score the model using the test data.

    clf_random_grid.score(X_test_scalar_combined, y_test)

Activity 17: Comparison of the Models

  1. Import the required libraries.

    from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
    from sklearn import metrics
  2. Fit the random forest classifier with the parameters obtained from grid search.

    clf_random_grid = RandomForestClassifier(n_estimators=100, max_depth=7,
         min_samples_split=10, random_state=0)
    clf_random_grid.fit(X_train_scalar_combined, y_train)
  3. Predict on the standardized scalar test data X_test_scalar_combined.

    y_pred=clf_random_grid.predict(X_test_scalar_combined)
  4. Fit the classification report.

    target_names = ['No Churn', 'Churn']
    print(classification_report(y_test, y_pred, target_names=target_names))
  5. Plot the confusion matrix.

    cm = confusion_matrix(y_test, y_pred) 
    cm_df = pd.DataFrame(cm,
                         index = ['No Churn','Churn'], 
                         columns = ['No Churn','Churn'])
    plt.figure(figsize=(8,6))
    sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
    plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
    plt.ylabel('True Values')
    plt.xlabel('Predicted Values')
    plt.show()
  6. Import the library for auc and roc curve.

    from sklearn.metrics import roc_curve,auc
  7. Use the classifiers which were created in our previous activity, that is, clf_logistic, clf_svm, clf_decision, and clf_random_grid. Create a dictionary of all these models.

    models = [
    {
        'label': 'Logistic Regression',
        'model': clf_logistic,
    },
    {
        'label': 'SVM',
        'model': clf_svm,
    },
    {
        'label': 'Decision Tree',
        'model': clf_decision,
    },
    {
        'label': 'Random Forest Grid Search',
        'model': clf_random_grid,
    }
    ]
  8. Plot the ROC curve.

    for m in models:
        model = m['model'] 
        model.fit(X_train_scalar_combined, y_train) 
        y_pred=model.predict(X_test_scalar_combined) 
        fpr, tpr, thresholds = roc_curve(y_test, y_pred, pos_label=1)
        roc_auc = metrics.auc(fpr, tpr)
        plt.plot(fpr, tpr, label='%s AUC = %0.2f' % (m['label'], roc_auc))
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.ylabel('Sensitivity(True Positive Rate)')
    plt.xlabel('1-Specificity(False Positive Rate)')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()

Comparing the AUC result of different algorithms (logistic regression: 0.78; SVM: 0.79, decision tree: 0.77, and random forest: 0.82), we can conclude that random forest is the best performing model with the AUC score of 0.82 and can be chosen for the marketing team to predict customer churn.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image