Packt+ | Advance your knowledge in tech

You're reading from Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Product type Paperback

Published in Mar 2019

Publisher

ISBN-13 9781789959413

Length 420 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Science

Authors (3):

Tommy Blanchard

Debasish Behera

Pranshu Bhatnagar

View More author details

Table of Contents (12) Chapters

Data Science for Marketing Analytics

Preface

1. Data Preparation and Cleaning FREE CHAPTER

2. Data Exploration and Visualization

3. Unsupervised Learning: Customer Segmentation

4. Choosing the Best Segmentation Approach

5. Predicting Customer Revenue Using Linear Regression

6. Other Regression Techniques and Tools for Evaluation

7. Supervised Learning: Predicting Customer Churn

8. Fine-Tuning Classification Algorithms

9. Modeling Customer Choice

Appendix

Chapter 9: Modeling Customer Choice

Activity 18: Performing Multiclass Classification and Evaluating Performance

Import pandas, numpy, randomforestclassifier, train_test_split, classification_report, confusion_matrix, accuracy_score, metrics, seaborn, matplotlib, and precision_recall_fscore_support:

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support
import matplotlib.pyplot as plt
import seaborn as sns

Load the marketing data using pandas:

data= pd.read_csv(r'MarketingData.csv')
data.head(5)

Check the shape, the missing values, and show the summary report of the data:
```
data.shape
```
The shape should be (20000,7). Check for missing values:
```
data.isnull().values.any()
```
This will return False as there are no null values in the data. See the summary report of the data using the describe function:
```
data.describe()
```
Check the target variable, Channel, for the number of transactions for each of the channels:
```
data['Channel'].value_counts()
```

Split the data into training and testing sets:

target = 'Channel'
X = data.drop(['Channel'],axis=1)
y=data[target]
X_train, X_test, y_train, y_test = train_test_split(X.values,y,test_size=0.20, random_state=123, stratify=y)

Fit a random forest classifier and store the model in a clf_random variable:

clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,
    min_samples_split=7, random_state=0)
clf_random.fit(X_train,y_train)

Predict on the test data and store the predictions in y_pred:
```
y_pred=clf_random.predict(X_test)
```
Find out the micro- and macro-average report:
```
precision_recall_fscore_support(y_test, y_pred, average='macro')
precision_recall_fscore_support(y_test, y_pred, average='micro')
```
You will get approximately the following values as output for macro- and micro-averages respectively: 0.891, 0.891, 0.891, None and 0.891, 0.891, 0.891, None.

Print the classification report:

target_names = ["Retail","RoadShow","SocialMedia","Televison"]
print(classification_report(y_test, y_pred,target_names=target_names))

Plot the confusion matrix:

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
                     index = target_names, 
                     columns = target_names)
plt.figure(figsize=(8,6))
sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
plt.ylabel('True Values')
plt.xlabel('Predicted Values')
plt.show()

From this activity, we can conclude that our random forest model was able to predict the most effective channel for marketing using customers' annual spend data with an accuracy of 89%.

Activity 19: Dealing with Imbalanced Data

Import all the necessary libraries.

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from collections import Counter

Read the dataset into a pandas DataFrame named bank and look at the first few rows of the data:
```
bank = pd.read_csv('bank.csv', sep = ';')
bank.head()
```

Rename the y column as Target:

bank = bank.rename(columns={
                        'y': 'Target'
                        })

Replace the no value with 0 and yes with 1:

bank['Target']=bank['Target'].replace({'no': 0, 'yes': 1})

Check the shape and missing values in the data:
```
bank.shape
bank.isnull().values.any()
```
Use the describe function to check the continuous and categorical values:
```
bank.describe()
bank.describe(include=['O'])
```
Check the count of the class labels present in the target variable:
```
bank['Target'].value_counts(0)
```

Use the cat.codes function to encode the job, marital, default, housing, loan, contact, and poutcome columns:

bank["job"] = bank["job"].astype('category').cat.codes
bank["marital"] = bank["marital"].astype('category').cat.codes
bank["default"] = bank["job"].astype('category').cat.codes
bank["housing"] = bank["marital"].astype('category').cat.codes
bank["loan"] = bank["loan"].astype('category').cat.codes
bank["contact"] = bank["contact"].astype('category').cat.codes
bank["poutcome"] = bank["poutcome"].astype('category').cat.codes

Since education and month are ordinal columns, convert them as follows:

bank['education']=bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2})
bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace  = True)
bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2})
bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace  = True)

Check the bank data after conversion:
```
bank.head()
```

Split the data into training and testing sets using train_test_split, as follows:

target = 'Target'
X = bank.drop(['Target'], axis=1)
y=bank[target]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)

Check the number of classes in y_train and y_test:

print(sorted(Counter(y_train).items()))
print(sorted(Counter(y_test).items()))

Use the standard_scalar function to transform the X_train and X_test data. Assign it to the X_train_sc and X_test_sc variables:

standard_scalar = StandardScaler()
X_train_sc = standard_scalar.fit_transform(X_train)
X_test_sc = standard_scalar.transform(X_test)

Call the random forest classifier with parameters n_estimators=20, max_depth=None, min_samples_split=7, and random_state=0:
```
clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,
min_samples_split=7, random_state=0)
```
Fit the random forest model:
```
clf_random.fit(X_train_sc,y_train)
```
Predict on the test data using the random forest model:
```
y_pred=clf_random.predict(X_test_sc)
```

Get the classification report:

target_names = ['No', 'Yes']
print(classification_report(y_test, y_pred,target_names=target_names))
cm = confusion_matrix(y_test, y_pred)

Get the confusion matrix:

cm_df = pd.DataFrame(cm,
                     index = ['No', 'Yes'], 
                     columns = ['No', 'Yes'])
plt.figure(figsize=(8,6))
sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
plt.ylabel('True Values')
plt.xlabel('Predicted Values')
plt.show()

Use the smote() function on x_train and y_train. Assign it to the x_resampled and y_resampled variables, respectively:
```
X_resampled, y_resampled = SMOTE().fit_resample(X_train,y_train)
```

Use standard_scalar to fit on x_resampled and x_test. Assign it to the X_train_sc_resampled and X_test_sc variables:

standard_scalar = StandardScaler()
X_train_sc_resampled = standard_scalar.fit_transform(X_resampled)
X_test_sc = standard_scalar.transform(X_test)

Fit the random forest classifier on X_train_sc_resampled and y_resampled:
```
clf_random.fit(X_train_sc_resampled,y_resampled)
```
Predict on X_test_sc:
```
y_pred=clf_random.predict(X_test_sc)
```

Generate the classification report:

target_names = ['No', 'Yes']
print(classification_report(y_test, y_pred,target_names=target_names))

Plot the confusion matrix:

cm = confusion_matrix(y_test, y_pred) 

cm_df = pd.DataFrame(cm,
                     index = ['No', 'Yes'], 
                     columns = ['No', 'Yes'])
plt.figure(figsize=(8,6))
sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
plt.ylabel('True Values')
plt.xlabel('Predicted Values')
plt.show()

In this activity, our bank marketing data was highly imbalanced. We observed that, although without using a sampling technique our model accuracy is around 90%, the recall score and macro-average score was 32% (Yes - Term Deposit) and 65%, respectively. This implies that our model is not able to generalize, and most of the time it misses potential customers who would subscribe to the term deposit.

On the other hand, when we used SMOTE, our model accuracy was around 87%, but the recall score and macro-average score was 61% (Yes - Term Deposit) and 76%, respectively. This implies that our model can generalize and, more than 60% of the time, it detects potential customers who would subscribe to the term deposit.