Chapter 8: Fine-Tuning Classification Algorithms
Activity 15: Implementing Different Classification Algorithms
Import the logistic regression library:
from sklearn.linear_model import LogisticRegression
Fit the model:
clf_logistic = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train[top7_features], y_train) clf_logistic
Score the model:
clf_logistic.score(X_test[top7_features], y_test)
Import the svm library:
from sklearn import svm
Fit the model:
clf_svm=svm.SVC(kernel='linear', C=1) clf_svm.fit(X_train[top7_features],y_train)
Score the model:
clf_svm.score(X_test[top7_features], y_test)
Import the decision tree library:
from sklearn import tree
Fit the model:
clf_decision = tree.DecisionTreeClassifier() clf_decision.fit(X_train[top7_features],y_train)
Score the model:
clf_decision.score(X_test[top7_features], y_test)
Import a random forest library:
from sklearn.ensemble import RandomForestClassifier
Fit the model:
clf_random = RandomForestClassifier(n_estimators=20, max_depth=None, min_samples_split=7, random_state=0) clf_random.fit(X_train[top7_features], y_train)
Score the model.
clf_random.score(X_test[top7_features], y_test)
From the results, you can conclude that the random forest has outperformed the rest of the algorithms, with the decision tree having the lowest accuracy. In a later section, you will learn why accuracy is not the correct way to find a model's performance.
Activity 16: Tuning and Optimizing the Model
Store five out of seven features, that is, Avg_Calls_Weekdays, Current_Bill_Amt, Avg_Calls, Account_Age, and Avg_Days_Delinquent in a variable top5_features. Store the other two features, Percent_Increase_MOM and Complaint_Code, in a variable top5_features.
from sklearn import preprocessing ## Features to transform top5_features=['Avg_Calls_Weekdays', 'Current_Bill_Amt', 'Avg_Calls', 'Account_Age','Avg_Days_Delinquent'] ## Features Left top2_features=['Percent_Increase_MOM','Complaint_Code']
Use StandardScalar to standardize the five features.
scaler = preprocessing.StandardScaler().fit(X_train[top5_features]) X_train_scalar=pd.DataFrame(scaler.transform(X_train[top5_features]),columns = X_train[top5_features].columns)
Create a variable X_train_scalar_combined, combine the standardized five features with the two features (Percent_Increase_MOM and Complaint_Code), which were not standardized.
X_train_scalar_combined=pd.concat([X_train_scalar, X_train[top2_features].reset_index(drop=True)], axis=1, sort=False)
Apply the same scalar standardization to the test data (X_test_scalar_combined).
X_test_scalar_combined=pd.concat([X_test_scalar, X_test[top2_features].reset_index(drop=True)], axis=1, sort=False)
Fit the random forest model.
clf_random.fit(X_train_scalar_combined, y_train)
Score the random forest model.
clf_random.score(X_test_scalar_combined, y_test)
Import the library for grid search and use the given parameters:
from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold parameters = [ {'min_samples_split': [4,5,7,9,10], 'n_estimators':[10,20,30,40,50,100,150,160,200,250,300],'max_depth': [2,5,7,10]}]
Use grid search CV with stratified k-fold to find out the best parameters.
clf_random_grid = GridSearchCV(RandomForestClassifier(), parameters, cv = StratifiedKFold(n_splits = 10)) clf_random_grid.fit(X_train_scalar_combined, y_train)
Print the best score and best parameters.
print('best score train:', clf_random_grid.best_score_) print('best parameters train: ', clf_random_grid.best_params_)
Score the model using the test data.
clf_random_grid.score(X_test_scalar_combined, y_test)
Activity 17: Comparison of the Models
Import the required libraries.
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score from sklearn import metrics
Fit the random forest classifier with the parameters obtained from grid search.
clf_random_grid = RandomForestClassifier(n_estimators=100, max_depth=7, min_samples_split=10, random_state=0) clf_random_grid.fit(X_train_scalar_combined, y_train)
Predict on the standardized scalar test data X_test_scalar_combined.
y_pred=clf_random_grid.predict(X_test_scalar_combined)
Fit the classification report.
target_names = ['No Churn', 'Churn'] print(classification_report(y_test, y_pred, target_names=target_names))
Plot the confusion matrix.
cm = confusion_matrix(y_test, y_pred) cm_df = pd.DataFrame(cm, index = ['No Churn','Churn'], columns = ['No Churn','Churn']) plt.figure(figsize=(8,6)) sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues') plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred))) plt.ylabel('True Values') plt.xlabel('Predicted Values') plt.show()
Import the library for auc and roc curve.
from sklearn.metrics import roc_curve,auc
Use the classifiers which were created in our previous activity, that is, clf_logistic, clf_svm, clf_decision, and clf_random_grid. Create a dictionary of all these models.
models = [ { 'label': 'Logistic Regression', 'model': clf_logistic, }, { 'label': 'SVM', 'model': clf_svm, }, { 'label': 'Decision Tree', 'model': clf_decision, }, { 'label': 'Random Forest Grid Search', 'model': clf_random_grid, } ]
Plot the ROC curve.
for m in models: model = m['model'] model.fit(X_train_scalar_combined, y_train) y_pred=model.predict(X_test_scalar_combined) fpr, tpr, thresholds = roc_curve(y_test, y_pred, pos_label=1) roc_auc = metrics.auc(fpr, tpr) plt.plot(fpr, tpr, label='%s AUC = %0.2f' % (m['label'], roc_auc)) plt.plot([0, 1], [0, 1],'r--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.ylabel('Sensitivity(True Positive Rate)') plt.xlabel('1-Specificity(False Positive Rate)') plt.title('Receiver Operating Characteristic') plt.legend(loc="lower right") plt.show()
Comparing the AUC result of different algorithms (logistic regression: 0.78; SVM: 0.79, decision tree: 0.77, and random forest: 0.82), we can conclude that random forest is the best performing model with the AUC score of 0.82 and can be chosen for the marketing team to predict customer churn.