When choosing the algorithm for machine learning, you have to consider numerous factors to properly choose the right algorithm for the task. It should not only based on the predicted output: category, value, cluster, and so on, but also on numerous other factors, such as these:
- Training time
- Size of data and number of features you're processing
- Accuracy
- Linearity
- Number of possible parameters
Training time can range from minutes to hours, depending not only on the algorithm, but also on the amount of features entering the model and the total amount of data that is being processed. However, a proper choice of algorithm can make the training time much shorter compared to the other. In general, regression models will reach the fastest training times, whereas neural network models will be on the other side of the training time length spectrum. Remember that developing a machine-learning model is iterative work. You will usually try several models and compare possible metrics. Based on the metric captured, you'll fine-tune the models and run comparison again on selected candidates and choose one model for operations. Even with more experience, you might not choose the right algorithm for your model at first, and you might be surprised that other algorithms can outperform the first chosen candidate, as shown:
With accuracy, there are actually several different metrics, which we can consider when evaluating how a machine-learning model works and performs. Different types of algorithms have different metrics that can be used for comparing the performance, and you won't find the same metrics among those at all. The most common metrics for classification types of algorithms would include these:
- Confusion matrix
- Accuracy
- AUC
- Precision
- Recall
- F1 score
Confusion matrix is the primary one we'll usually use to evaluate the machine-learning model because it's very intuitive and easy to understand. The confusion matrix is actually a pivot table with actual and predicted dimensions displaying the amount of classes in those dimensions. This matrix is not used as a metric itself, but the numbers on the matrix are used for most of the other possible performance measures for the model:
This matrix displays four values. True Positive is the amount of cases when the prediction was true and the actual data was also true. Consider again that we're in the supervised learning category, so this means that the model is training and scoring itself based on known data. So we know the current state—actual and check if the prediction can match the current state. True Negative is actually the opposite—the actual state was false and the prediction was right in predicting false. False Positive and false negative are cases where prediction and actual don't match. A False Positive is also known as a Type I error, and a False Negative as a Type II error, in statistics. The Type I error rate is also known as the significance level, and is usually set to 0.05 (5%). There will always be some errors with your model; otherwise, the model will be overfitted if there won't be any errors at all, and in real production after development, the model would not perform well with unknown data points. There's no rule on what type of error to minimize, this is solely dependent on the business case and the type of data science project and question that you're working on.
Accuracy can be another metric that is used to evaluate a model. Accuracy is represented by a number of correct predictions over all the predictions on the model:
Accuracy is a very good metric if the classes in the data are balanced, but it can be very misleading if one of the classes is extremely dominant compared to the other. As an example, consider a dataset with 100 emails, where only five emails are spam. If the model has a terrible performance and marks all emails as non-spam, it would classify 95 emails correctly and five emails incorrectly. Although the model was not able to classify any email as spam, it's accuracy would still be 95%. As you can see, this can be very misleading with the dataset, which is not balanced.
Very similar to accuracy is precision. Precision is a ratio between positive predictions and total predictions:
Recall is very closely used with precision and the calculation is very similar, where another part of the confusion matrix is used in the denominator:
Recall and precision give us information about the machine-learning model in regard of false negatives and false positives. Precision tells us how precise the model is, where recall is not about capturing the case correctly but more about capturing all cases with a given classification. If you would like to minimize false negatives, you would fine-tune the model to have high recall (nearly 100%) with reasonable precision, and the same if you would like to minimize Type II errors or false positives: you would focus on high precision in your model metrics.
As these two metrics are very close, they are also used to calculate F1 score, which is a combination of both expressed as Harmonic Mean:
When we see a machine-learning model being evaluated, you can usually find all these metrics, together with the confusion matrix, in one place:
The last missing piece of the basic performance metrics is the Area under ROC Curve (AUC) metric. AUC represents the model's ability to properly determine between positive and negative outcomes. When the AUC = 0.5, the model is actually randomly guessing the outcome, and when the AUC reaches 1.0, the model is 100% accurate. Receiver Operating Characteristics (ROC) can be broken down into two factors:
- Sensitivity: Defined as true positive rate, or actually the recall
- Specificity: Defined as false positive rate:
On the chart, you can see a comparison between two ROC curves for a machine-learning model, comparing the performance of two different algorithms used on the same dataset.