So far, we have covered the first machine learning classifier and evaluated its performance by prediction accuracy in-depth. Beyond accuracy, there are several measurements that give us more insights and avoid class imbalance effects.
Confusion matrix summarizes testing instances by their predicted values and true values, presented as a contingency table:
To illustrate, we compute the confusion matrix of our naive Bayes classifier. Here the scikit-learn confusion_matrix function is used, but it is very easy to code it ourselves:
>>> from sklearn.metrics import confusion_matrix
>>> confusion_matrix(Y_test, prediction, labels=[0, 1])
array([[1098, 93],
[ 43, 473]])
Note that we consider 1 the spam class to be positive. From the confusion matrix, for example, there are 93 false positive cases (where it misinterprets a legitimate email as a spam one), and...