Concepts to analyze model performance and reliability
Analyzing the performance and reliability of our machine-learning model is an important development step and should be done before implementing the model to production. There are several metrics that you can use to analyze the performance and reliability of a machine learning model, depending on the specific task and problem you are trying to solve. In this section, we will cover some of these techniques, focusing on ones that Qlik tools are using.
Regression model scoring
The following concepts can be used to score and verify regression models. Regression models predict outcomes as a number, indicating the model’s best estimate of the target variable. We will learn more about regression models in Chapter 2.
R2 (R-squared)
R-squared is a statistical measure that represents the proportion of the variance in a dependent variable that is explained by an independent variable (or variables) in a regression model. In other words, it measures the goodness of fit of a regression model to the data.
R-squared ranges from 0 to 1, where 0 indicates that the model does not explain any of the variability in the dependent variable, and 1 indicates that the model perfectly explains all the variability in the dependent variable.
R-squared is an important measure of the quality of a regression model. A high R-squared value indicates that the model fits the data well and that the independent variable(s) have a strong relationship with the dependent variable. A low R-squared value indicates that the model does not fit the data well and that the independent variable(s) do not have a strong relationship with the dependent variable. However, it is important to note that a high R-squared value does not necessarily mean that the model is the best possible model, so other factors such as overfitting should also be taken into consideration when evaluating the performance of a model. R-squared is often used together with other metrics and it should be interpreted in the context of the problem. The formula for R-squared is the following:
R 2 = Variance explained by the model _______________________ Total variance
There are some limitations for R-squared. It cannot be used to check whether the prediction is biased or not and it doesn’t tell us whether the regression model has an adequate fit or not. Bias refers to systematic errors in predictions. To check for bias, you should analyze residuals (differences between predicted and observed values) or use bias-specific metrics such as Mean Absolute Error (MAE) and Mean Bias Deviation (MBD). R-squared primarily addresses model variance, not bias.
Sometimes it is better to utilize adjusted R-squared. Adjusted R-squared is a modified version of the standard R-squared used in regression analysis. We can use adjusted R-squared when dealing with multiple predictors to assess model fit, control overfitting, compare models with different predictors, and aid in feature selection. It accounts for the number of predictors, penalizing unnecessary complexity. However, it should be used alongside other evaluation metrics and domain knowledge for a comprehensive model assessment.
Root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE)
Root mean squared error is the average difference that can be expected between predicted and actual value. It is the standard deviation of the residuals (prediction errors) and tells us how concentrated the data is around the “line of best fit.” It is a standard way to measure the error of a model when predicting quantitative data. RMSE is always measured in the same unit as the target value.
As an example of RMSE, if we have a model that predicts house value in a certain area and we get an RMSE of 20,000, it means that, on average, the predicted value differs 20,000 USD from the actual value.
Mean absolute error is defined as an average of all absolute prediction errors in all data points. In MAE, different errors are not weighted but the scores increase linearly with the increase in error. MAE is always a positive value since we are using an absolute value of error. MAE is useful when the errors are symmetrically distributed and there are no significant outliers.
Mean squared error is a squared average difference between the predicted and actual value. Squaring eliminates the negative values and ensures that MSE is always positive or 0. The smaller the MSE, the closer our model to the “line of best fit.” RMSE can be calculated using MSE. RMSE is a square root of MSE.
When to use the above metrics in practice
MAE is robust to outliers and provides a straightforward interpretation of the average error magnitude.
MSE penalizes large errors more heavily and is suitable when you want to minimize the impact of outliers on the error metric.
RMSE is similar to MSE but provides a more interpretable error metric in the same units as the target variable.
The choice between these metrics should align with your specific problem and objectives. Its also good practice to consider the nature of your data and the impact of outliers when selecting an error metric. Additionally, you can use these metrics in conjunction with other evaluation techniques to get a comprehensive view of your model’s performance.
Multiclass classification scoring and binary classification scoring
The following concepts can be used to score and verify multiclass and binary models. Binary classification models distribute outcomes into two categories, typically denoted as Yes or No. Multiclass classification models are similar, but there are more than two categories as an outcome. We will learn more about both models in Chapter 2.
Recall
Recall measures the percentage of correctly classified positive instances over the total number of actual positive instances. In other words, recall represents the ability of a model to correctly capture all positive instances.
Recall is calculated as follows:
Recall = True positive ______________________ (True positive + False negative)
A high recall indicates that the model is able to accurately capture all positive instances and has a low rate of false negatives. On the other hand, a low recall indicates that the model is missing many positive instances, resulting in a high rate of false negatives.
Precision
Precision measures the percentage of correctly classified positive instances over the total number of predicted positive instances. In other words, precision represents the ability of the model to correctly identify positive instances.
Precision is calculated as follows:
Precision = True positive _____________________ (True positive + False positive)
A high precision indicates that the model is able to accurately identify positive instances and has a low rate of false positives. On the other hand, a low precision indicates that the model is incorrectly classifying many instances as positive, resulting in a high rate of false positives.
Precision is particularly useful in situations where false positives are costly or undesirable, such as in medical diagnosis or fraud detection. Precision should be used in conjunction with other metrics, such as recall and F1 score, to get a more complete picture of the model’s performance.
F1 score
The F1 score is defined as the harmonic mean of precision and recall, and it ranges from 0 to 1, with higher values indicating better performance. The formula for F1 score is as follows:
F1 score = 2 * (precision * recall) _____________ (precision + recall)
The F1 score gives equal importance to both precision and recall, making it a useful metric for evaluating models when the distribution of positive and negative instances is uneven. A high F1 score indicates that the model has a good balance between precision and recall and can accurately classify both positive and negative instances.
In general, the more imbalanced the dataset is, the lower the F1 score is likely to be. It’s crucial to recognize that, when dealing with highly imbalanced datasets where one class greatly outnumbers the other, the F1 score may be influenced. A more imbalanced dataset can result in a reduced F1 score. Being aware of this connection can assist in interpreting F1 scores within the context of particular data distributions and problem domains. If the F1 value is high, all other metrics will be high as well, and if it is low, there is a need for further analysis.
Accuracy
Accuracy measures the percentage of correctly classified instances over the total number of instances. In other words, accuracy represents the ability of the model to correctly classify both positive and negative instances.
Accuracy is calculated in the following way:
Accuracy = (True positive + True negative) ____________________________________________ (True positive + False positive + True negative + False negative)
A high accuracy indicates that the model is able to accurately classify both positive and negative instances and has a low rate of false positives and false negatives. However, accuracy can be misleading in situations where the distribution of positive and negative instances is uneven. In such cases, other metrics such as precision, recall, and F1 score may provide a more accurate representation of the model’s performance.
Accuracy can mislead in imbalanced datasets where one class vastly outnumbers the others. This is because accuracy doesn’t consider the class distribution and can be high even if the model predicts the majority class exclusively. To address this, use metrics such as precision, recall, F1-score, AUC-ROC, and AUC-PR, which provide a more accurate evaluation of model performance by focusing on the correct identification of the minority class, which is often the class of interest in such datasets.
Example scenario
Suppose we are developing a machine-learning model to detect a rare disease that occurs in only 1% of the population. We collect a dataset of 10,000 patient records:
- 100 patients have the rare disease (positive class)
- 9,900 patients do not have the disease (negative class)
Now, let’s say our model predicts all 10,000 patients as not having the disease. Here’s what happens:
- True Positives (correctly predicted patients with the disease): 0
- False Positives (incorrectly predicted patients with the disease): 0
- True Negatives (correctly predicted patients without the disease): 9,900
- False Negatives (incorrectly predicted patients without the disease): 100
Using accuracy as our evaluation metric produces the following result:
Accuracy = True positive + True negative _____________________ Total = 9900 _ 10000 = 99%
Our model appears to have an impressive 99% accuracy, which might lead to the misleading conclusion that it’s performing exceptionally well. However, it has completely failed to detect any cases of the rare disease (True Positives = 0), which is the most critical aspect of the problem.
In this example, accuracy doesn’t provide an accurate picture of the model’s performance because it doesn’t account for the severe class imbalance and the importance of correctly identifying the minority class (patients with the disease).
Confusion matrix
A confusion matrix is a table used to evaluate the performance of a classification model. It displays the number of true positive, false positive, true negative, and false negative predictions made by the model for a set of test data.
The four elements in the confusion matrix represent the following:
- True positives (TP) are actual true values that were correctly predicted as true
- False positives (FP) are actual false values that were incorrectly predicted as true
- False negatives (FN) are actual true values that were incorrectly predicted as false
- True negatives (TN) are actual false values that were correctly predicted as false
Qlik AutoML presents a confusion matrix as part of the experiment view. Below the numbers in each quadrant, you can also see percentage values for the metrics recall (TP), fallout (FP), miss rate (FN), and specificity (TN).
An example of the confusion matrix of Qlik AutoML can be seen in the following figure:
Figure 1.4: Confusion matrix as seen in Qlik AutoML
By analyzing the confusion matrix, we can calculate various performance metrics such as accuracy, precision, recall, and F1 score, which can help us understand how well the model is performing on the test data. The confusion matrix can also help us identify any patterns or biases in the model’s predictions and adjust the model accordingly.
Matthews Correlation Coefficient (MCC)
The Matthews Correlation Coefficient metric can be used to evaluate the performance of a binary classification model, particularly when dealing with imbalanced data.
MCC takes into account all four elements of the confusion matrix (true positives, false positives, true negatives, and false negatives) to provide a measure of the quality of a binary classifier’s predictions. It ranges between -1 and +1, with a value of +1 indicating perfect classification performance, 0 indicating a random classification, and -1 indicating complete disagreement between predicted and actual values.
The formula for MCC is as follows:
MCC = (TP x TN − FP x FN) ________________________________ √ _____________________________________ ((TP + FP) x (TP + FN) x (TN + FP) x (TN + FN))
MCC is particularly useful when dealing with imbalanced datasets where the number of positive and negative instances is not equal. It provides a better measure of classification performance than accuracy in such cases, since accuracy can be biased toward the majority class.
AUC and ROC curve
The ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classification model that allows us to evaluate and compare different models based on their ability to discriminate between positive and negative classes. AUC describes the area under the curve.
An ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds. The TPR is the ratio of true positive predictions to the total number of actual positive instances, while the FPR is the ratio of false positive predictions to the total number of actual negative instances.
By varying the classification threshold, we can obtain different TPR and FPR pairs and plot them on the ROC curve. The area under the ROC curve (AUC-ROC) is used as a performance metric for binary classification models, with higher AUC-ROC indicating better performance.
A perfect classifier would have an AUC-ROC of 1.0, indicating that it has a high TPR and low FPR across all possible classification thresholds. A random classifier would have an AUC-ROC of 0.5, indicating that its TPR and FPR are equal and its performance is no better than chance.
The ROC curve and AUC-ROC are useful for evaluating and comparing binary classification models, especially when the positive and negative classes are imbalanced or when the cost of false positive and false negative errors is different.
The following figure represents an ROC curve as seen in Qlik AutoML. The figure shows a pretty good ROC curve (it is good since the curve should be as close to 1 as possible). The dotted line is 50:50 random chance.
Figure 1.5: ROC curve for a good model in Qlik AutoML
Threshold
In binary classification, a threshold is a value that is used to decide whether an instance should be classified as positive or negative by a model.
When a model makes a prediction, it generates a probability score between 0 and 1 that represents the likelihood of an instance belonging to the positive class. If the score is above a certain threshold value, the instance is classified as positive, and if it is below the threshold, it is classified as negative.
The choice of threshold can significantly impact the performance of a classification model. If the threshold is set too high, the model may miss many positive instances, leading to a low recall and high precision. Conversely, if the threshold is set too low, the model may classify many negative instances as positive, leading to a high recall and low precision.
Therefore, selecting an appropriate threshold for a classification model is important in achieving the desired balance between precision and recall. The optimal threshold depends on the specific application and the cost of false positive and false negative errors.
Qlik AutoML computes the precision and recall for hundreds of possible threshold values from 0 to 1. A threshold achieving the highest F1 score is chosen. By selecting a threshold, the produced predictions are more robust for imbalanced datasets.
Feature importance
Feature importance is a measure of the contribution of each input variable (feature) in a model to the output variable (prediction). It is a way to understand which features have the most impact on the model’s prediction, and which features can be ignored or removed without significantly affecting the model’s performance.
Feature importance can be computed using various methods, depending on the type of model used. Some common methods for calculating feature importance include the following:
- Permutation importance: This method involves shuffling the values of each feature in the test data, one at a time, and measuring the impact on the model’s performance. The features that cause the largest drop in performance when shuffled are considered more important.
- Feature importance from tree-based models: In decision tree-based models such as Random Forest or Gradient Boosting, feature importance can be calculated based on how much each feature decreases the impurity of the tree. The features that reduce impurity the most are considered more important.
- Coefficient magnitude: In linear models such as Linear Regression or Logistic Regression, feature importance can be calculated based on the magnitude of the coefficients assigned to each feature. The features with larger coefficients are considered more important.
Feature importance can help in understanding the relationship between the input variables and the model’s prediction and can guide feature selection and engineering efforts to improve the model’s performance. It can also provide insights into the underlying problem and the data being used and can help in identifying potential biases or data quality issues.
In Qlik AutoML, the permutation importance of each feature is represented as a graph. This can be used to estimate feature importance. Another method that is visible in AutoML is SHAP importance values. The next section will cover the principles of SHAP importance values.
SHAP values
SHAP (SHapley Additive exPlanations) values are a technique for interpreting the output of machine-learning models by assigning an importance score to each input feature.
SHAP values are based on game theory and the concept of Shapley values, which provide a way to fairly distribute the value of a cooperative game among its players. In the context of machine learning, the game is the prediction task, and the players are the input features. The SHAP values represent the contribution of each feature to the difference between a specific prediction and the expected value of the output variable.
The SHAP values approach involves computing the contribution of each feature by evaluating the model’s output for all possible combinations of features, with and without the feature of interest. The contribution of the feature is the difference in the model’s output between the two cases averaged over all possible combinations.
SHAP values provide a more nuanced understanding of the relationship between the input features and the model’s output than other feature importance measures, as they account for interactions between features and the potential correlation between them.
SHAP values can be visualized using a SHAP plot, which shows the contribution of each feature to the model’s output for a specific prediction. This plot can help in understanding the relative importance of each feature and how they are influencing the model’s prediction.
Difference between SHAP and permutation importance
Permutation importance and SHAP are alternative ways of measuring feature importance. The main difference between the two is that permutation importance is based on the decrease in model performance. It is a simpler and more computationally efficient approach to compute feature importance but may not accurately reflect the true importance of features in complex models.
SHAP importance is based on the magnitude of feature attributions. SHAP values provide a more nuanced understanding of feature importance but can be computationally expensive and may not be feasible for very large datasets or complex models.
Permutation importance can be used to do the following:
- Understand which features to keep and which to abandon
- Understand the feature importance for model accuracy
- Understand if there is a data leakage, meaning information from outside the training dataset is used to create or evaluate a model, resulting in over-optimistic performance estimates or incorrect predictions
SHAP importance can be used to do the following:
- Understand which features have greatest influence to the predicted outcome
- Understand how the different values of the feature affect the model prediction
- Understand what the most influential rows are in the dataset
We can see an example of a permutation importance graph and SHAP graph in the following figure, as seen in Qlik AutoML:
Figure 1.6: Permutation importance and SHAP importance graphs
Note
We will utilize both permutation importance and SHAP importance in our hands-on examples later in this book.