Chapter 9: Modeling Customer Choice
Activity 18: Performing Multiclass Classification and Evaluating Performance
Import pandas, numpy, randomforestclassifier, train_test_split, classification_report, confusion_matrix, accuracy_score, metrics, seaborn, matplotlib, and precision_recall_fscore_support:
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report,confusion_matrix,accuracy_score from sklearn import metrics from sklearn.metrics import precision_recall_fscore_support import matplotlib.pyplot as plt import seaborn as sns
Load the marketing data using pandas:
data= pd.read_csv(r'MarketingData.csv') data.head(5)
Check the shape, the missing values, and show the summary report of the data:
data.shape
The shape should be (20000,7). Check for missing values:
data.isnull().values.any()
This will return False as there are no null values in the data. See the summary report of the data using the describe function:
data.describe()
Check the target variable, Channel, for the number of transactions for each of the channels:
data['Channel'].value_counts()
Split the data into training and testing sets:
target = 'Channel' X = data.drop(['Channel'],axis=1) y=data[target] X_train, X_test, y_train, y_test = train_test_split(X.values,y,test_size=0.20, random_state=123, stratify=y)
Fit a random forest classifier and store the model in a clf_random variable:
clf_random = RandomForestClassifier(n_estimators=20, max_depth=None, min_samples_split=7, random_state=0) clf_random.fit(X_train,y_train)
Predict on the test data and store the predictions in y_pred:
y_pred=clf_random.predict(X_test)
Find out the micro- and macro-average report:
precision_recall_fscore_support(y_test, y_pred, average='macro') precision_recall_fscore_support(y_test, y_pred, average='micro')
You will get approximately the following values as output for macro- and micro-averages respectively: 0.891, 0.891, 0.891, None and 0.891, 0.891, 0.891, None.
Print the classification report:
target_names = ["Retail","RoadShow","SocialMedia","Televison"] print(classification_report(y_test, y_pred,target_names=target_names))
Plot the confusion matrix:
cm = confusion_matrix(y_test, y_pred) cm_df = pd.DataFrame(cm, index = target_names, columns = target_names) plt.figure(figsize=(8,6)) sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues') plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred))) plt.ylabel('True Values') plt.xlabel('Predicted Values') plt.show()
From this activity, we can conclude that our random forest model was able to predict the most effective channel for marketing using customers' annual spend data with an accuracy of 89%.
Activity 19: Dealing with Imbalanced Data
Import all the necessary libraries.
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from imblearn.over_sampling import SMOTE from sklearn.preprocessing import StandardScaler from collections import Counter
Read the dataset into a pandas DataFrame named bank and look at the first few rows of the data:
bank = pd.read_csv('bank.csv', sep = ';') bank.head()
Rename the y column as Target:
bank = bank.rename(columns={ 'y': 'Target' })
Replace the no value with 0 and yes with 1:
bank['Target']=bank['Target'].replace({'no': 0, 'yes': 1})
Check the shape and missing values in the data:
bank.shape bank.isnull().values.any()
Use the describe function to check the continuous and categorical values:
bank.describe() bank.describe(include=['O'])
Check the count of the class labels present in the target variable:
bank['Target'].value_counts(0)
Use the cat.codes function to encode the job, marital, default, housing, loan, contact, and poutcome columns:
bank["job"] = bank["job"].astype('category').cat.codes bank["marital"] = bank["marital"].astype('category').cat.codes bank["default"] = bank["job"].astype('category').cat.codes bank["housing"] = bank["marital"].astype('category').cat.codes bank["loan"] = bank["loan"].astype('category').cat.codes bank["contact"] = bank["contact"].astype('category').cat.codes bank["poutcome"] = bank["poutcome"].astype('category').cat.codes
Since education and month are ordinal columns, convert them as follows:
bank['education']=bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2}) bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace = True) bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2}) bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace = True)
Check the bank data after conversion:
bank.head()
Split the data into training and testing sets using train_test_split, as follows:
target = 'Target' X = bank.drop(['Target'], axis=1) y=bank[target] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)
Check the number of classes in y_train and y_test:
print(sorted(Counter(y_train).items())) print(sorted(Counter(y_test).items()))
Use the standard_scalar function to transform the X_train and X_test data. Assign it to the X_train_sc and X_test_sc variables:
standard_scalar = StandardScaler() X_train_sc = standard_scalar.fit_transform(X_train) X_test_sc = standard_scalar.transform(X_test)
Call the random forest classifier with parameters n_estimators=20, max_depth=None, min_samples_split=7, and random_state=0:
clf_random = RandomForestClassifier(n_estimators=20, max_depth=None, min_samples_split=7, random_state=0)
Fit the random forest model:
clf_random.fit(X_train_sc,y_train)
Predict on the test data using the random forest model:
y_pred=clf_random.predict(X_test_sc)
Get the classification report:
target_names = ['No', 'Yes'] print(classification_report(y_test, y_pred,target_names=target_names)) cm = confusion_matrix(y_test, y_pred)
Get the confusion matrix:
cm_df = pd.DataFrame(cm, index = ['No', 'Yes'], columns = ['No', 'Yes']) plt.figure(figsize=(8,6)) sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues') plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred))) plt.ylabel('True Values') plt.xlabel('Predicted Values') plt.show()
Use the smote() function on x_train and y_train. Assign it to the x_resampled and y_resampled variables, respectively:
X_resampled, y_resampled = SMOTE().fit_resample(X_train,y_train)
Use standard_scalar to fit on x_resampled and x_test. Assign it to the X_train_sc_resampled and X_test_sc variables:
standard_scalar = StandardScaler() X_train_sc_resampled = standard_scalar.fit_transform(X_resampled) X_test_sc = standard_scalar.transform(X_test)
Fit the random forest classifier on X_train_sc_resampled and y_resampled:
clf_random.fit(X_train_sc_resampled,y_resampled)
Predict on X_test_sc:
y_pred=clf_random.predict(X_test_sc)
Generate the classification report:
target_names = ['No', 'Yes'] print(classification_report(y_test, y_pred,target_names=target_names))
Plot the confusion matrix:
cm = confusion_matrix(y_test, y_pred) cm_df = pd.DataFrame(cm, index = ['No', 'Yes'], columns = ['No', 'Yes']) plt.figure(figsize=(8,6)) sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues') plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred))) plt.ylabel('True Values') plt.xlabel('Predicted Values') plt.show()
In this activity, our bank marketing data was highly imbalanced. We observed that, although without using a sampling technique our model accuracy is around 90%, the recall score and macro-average score was 32% (Yes - Term Deposit) and 65%, respectively. This implies that our model is not able to generalize, and most of the time it misses potential customers who would subscribe to the term deposit.
On the other hand, when we used SMOTE, our model accuracy was around 87%, but the recall score and macro-average score was 61% (Yes - Term Deposit) and 76%, respectively. This implies that our model can generalize and, more than 60% of the time, it detects potential customers who would subscribe to the term deposit.