Evaluating the model’s effectiveness
Accuracy and loss are not enough to judge the model’s effectiveness. In general, accuracy is a good performance indicator if the dataset is balanced, but it does not tell us the strengths and weaknesses of our model. For instance, what classes do we recognize with high confidence? What frequent mistakes does the model make?
This recipe will judge the model’s effectiveness by visualizing the confusion matrix and evaluating the recall, precision, and F1-score performance metrics.
Getting ready
To complete this recipe, we must familiarize ourselves with the confusion matrix and the alternative performance metrics crucial for evaluating the model’s effectiveness. Let’s start by learning the confusion matrix in the following subsection.
Evaluating the performance with the confusion matrix
A confusion matrix is an NxN matrix reporting the number of correct and incorrect predictions on the test dataset, where N is the number of output classes.
For our binary classification model, where there are two output categories, we have a 2x2 matrix like the one in Figure 3.8:
data:image/s3,"s3://crabby-images/ad280/ad2809e59bd0fd610e1e66632d0c9dae2fb212be" alt="A diagram of negative negatives
Description automatically generated"
Figure 3.8: A confusion matrix
The four values reported in the previous confusion matrix are as follows:
- True positive (TP): The number of predicted positive results that are actually positive
- True negative (TN): The number of predicted negative results that are actually negative
- False positive (FP): The number of predicted positive results that are actually negative
- False negative (FN): The number of predicted negative results that are actually positive
Ideally, we would like to have 100% accuracy, defined as the ratio of correctly predicted instances (both positive and negative) to the total number of instances in the dataset:
The preceding formula implies that the confusion matrix’s gray cells (FN and FP) should be 0 to obtain 100% accuracy.
However, although accuracy is a valuable metric, it does not provide a complete picture of model performance. Therefore, the following subsections will present alternative performance metrics for assessing the model’s effectiveness.
Evaluating recall, precision, and F-score
The first performance metric we want to present is recall, which quantifies how many of all positive (“Yes”) samples we predicted correctly:
As a result, recall should be as high as possible.
However, this metric does not consider the misclassification of negative samples. Hence, the model could be excellent at classifying positive samples but incapable of classifying negative ones.
For this reason, there is an alternative performance indicator that considers FPs. It is precision, which quantifies how many predicted positive classes (“Yes”) were actually positive:
Therefore, as with recall, precision should be as high as possible.
If we are interested in evaluating both recall and precision simultaneously, the F-score metric is what we need. In fact, this metric combines recall and precision with a single formula as follows:
The higher the F-score, the better the model’s effectiveness.
How to do it…
Continue working in Colab, and follow the steps to visualize the confusion matrix and calculate the recall, precision, and F-score metrics:
Step 1:
Use the trained model to predict the output classes of the test dataset:
y_test_pred = model.predict(x_test)
y_test_pred = (y_test_pred > 0.5).astype("int32")
The line y_test_pred = (y_test_pred > 0.5).astype("int32")
binarizes the predicted values using a threshold of 0.5. If a predicted value is greater than 0.5, it is converted to 1.
Otherwise, it is converted to 0.
Step 2:
Compute the confusion matrix with scikit-learn:
import sklearn
cm = sklearn.metrics.confusion_matrix(y_test,
y_test_pred)
The confusion matrix is obtained with the confusion_matrix()
function from the scikit-learn library, which takes two arguments: the true labels of the test dataset (y_test
) and the predicted labels (y_test_pred
). The cm
variable stores the confusion matrix.
Step 3:
Display the confusion matrix in a heatmap:
index_names = ["Actual No Snow", "Actual Snow"]
column_names = ["Predicted No Snow", "Predicted Snow"]
df_cm = pd.DataFrame(cm, index = index_names,
columns = column_names)
plt.figure(dpi=150)
sns.heatmap(df_cm, annot=True, fmt='d', cmap="Blues")
The previous code should produce a heatmap similar to the following one:
data:image/s3,"s3://crabby-images/8ae99/8ae9939c98c37d1a88d14acf013ca1956b1e8527" alt="A screenshot of a graph
Description automatically generated"
Figure 3.9: Confusion matrix obtained with the test dataset
The confusion matrix shows that the samples are mainly distributed in the leading diagonal, and there are more FPs than FNs. Therefore, although the network is suitable for detecting snow, we should expect some false detections.
Step 4:
Calculate the recall, precision, and F-score performance metrics:
TN = cm[0][0]
TP = cm[1][1]
FN = cm[1][0]
FP = cm[0][1]
accur = (TP + TN) / (TP + TN + FN + FP)
precis = TP / (TP + FP)
recall = TP / (TP + FN)
f_score = (2 * recall * precis) / (recall + precis)
print("Accuracy: ", round(accur, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precis, 3))
print("F-score: ", round(f_score, 3))
The preceding code prints the performance metrics on the output console, resulting in an output similar to what is shown in the following screenshot:
data:image/s3,"s3://crabby-images/98de4/98de4b6d971573c9f181d082d31368f97d148e31" alt="A black text with black text
Description automatically generated with medium confidence"
Figure 3.10: Precision, recall, and F-score results
Based on the results reported in the preceding screenshot, which might slightly differ from yours, we can observe that the model has a high recall value of 0.923, indicating that it can accurately predict snowfall. However, the precision value of 0.818 is comparatively lower, meaning the model may produce some false alarms.
The F-score value of 0.867 demonstrates a balance between recall and precision metrics, meaning the model can accurately predict snow instances using the given input features.
There’s more…
In this recipe, we learned how to assess the model’s effectiveness by visualizing the confusion matrix and evaluating the recall, precision, and F-score metrics.
However, scikit-learn is not the only way to compute the confusion matrix. In fact, TensorFlow also provides a tool to calculate the confusion matrix as well. To delve deeper into this topic, we recommend referring to the TensorFlow documentation at the following link: https://www.tensorflow.org/versions/r2.13/api_docs/python/tf/math/confusion_matrix.
After evaluating the model’s effectiveness, the model’s quantization is the only step separating us from the beginning of the model deployment on the microcontroller.
In the upcoming recipe, we will compress the trained model by quantizing it to 8-bit using the TensorFlow Lite converter.