Exploring statistical tests for comparing model metrics
In machine learning, metric-based model evaluation often involves using averages of aggregated metrics from different folds or partitions, such as holdout and validation sets, to compare the performance of various models. However, relying solely on these average performance metrics may not provide a comprehensive assessment of a model’s performance and generalizability. A more robust approach to model evaluation is the incorporation of statistical hypothesis tests, which assess whether observed differences in performance are statistically significant or due to random chance.
Statistical hypothesis tests are procedures used to determine whether observed data provides sufficient evidence to reject a null hypothesis in favor of an alternative hypothesis, helping to quantify the likelihood that the observed differences are due to random chance or a genuine effect. In statistical tests, the null hypothesis (H0) is a default...