Complementary statistical tests
Here, a model is selected over another plausible one. The accuracy of one model seems higher than the other. The area under curve (AUC) of the ROC of a model is greater than that of another. However, it is not appropriate to base the conclusion on pure numbers only. It is important to conclude whether the numbers hold significance from the point of view of statistical inference. In the analytical world, it is pivotal that we make use of statistical tests whenever they are available to validate claims/hypotheses. A reason for using statistical tests is that probability can be highly counterintuitive, and what appears on the surface might not be the case upon closer inspection, after incorporating the chance variation. For instance, if a fair coin is tossed 100 times, it is imprudent to think that the number of heads must be exactly 50. Hence, if a fair coin shows up 45 heads, we need to incorporate the chance variation that the number of heads can be less than 50 too. Caution must be exerted all the while when we are dealing with uncertain data. A few examples are in order here. Two variables might appear to be independent of each other, and the correlation might also be nearly equal to zero. However, applying the correlation test might result in the conclusion that the correlation is not significantly zero. Since we will be sampling and resampling a lot in this text, we will look at related tests.
Permutation test
Suppose that we have two processes, A and B, and the variances of these two processes are known to be equal, though unknown. Three independent observations from process A result in yields of 18, 20, and 22, while three independent observations from process B gives yields of 24, 26, and 28. Under the assumption that the yield follows a normal distribution, we would like to test whether the means of processes A and B are the same. This is a suitable case for applying the t-test, since the number of observations is smaller. An application of the t.test
function shows that the two means are different to each other, and this intuitively appears to be the case.
Now, the assumption under the null hypothesis is that the means are equal, and that the variance is unknown and assumed to be equal under the two processes. Consequently, we have a genuine reason to believe that the observations from process A might well have occurred in process B too, and vice versa. We can therefore swap one observation in process B with process A, and recompute the t-test. The process can be repeated for all possible permutations of the two samples. In general, if we have m samples from population 1 and n samples from population 2, we can have
different samples and as many tests. An overall test can be based on such permutation samples and such tests are called permutation tests.
For process A and B observations, we will first apply the t-test and then the permutation test. The t.test
is available in the core stats
package and the permutation t-test is taken from the perm
package:
> library(perm) > x <- c(18,20,22); y <- c(24,26,28) > t.test(x,y,var.equal = TRUE) Two Sample t-test data: x and y t = -3.6742346, df = 4, p-value = 0.02131164 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.533915871 -1.466084129 sample estimates: mean of x mean of y 20 26
The smaller p-value suggests that the means of processes A and B are not equal. Consequently, we now apply the permutation test permTS
from the perm
package:
> permTS(x,y) Exact Permutation Test (network algorithm) data: x and y p-value = 0.1 alternative hypothesis: true mean x - mean y is not equal to 0 sample estimates: mean x - mean y -6
The p-value is now at 0.1, which means that the permutation test concludes that the means of the processes are equal. Does this mean that the permutation test will always lead to this conclusion, contradicting the t-test? The answer is given in the next code segment:
> x2 <- c(16,18,20,22); y2 <- c(24,26,28,30) > t.test(x2,y2,var.equal = TRUE) Two Sample t-test data: x2 and y2 t = -4.3817805, df = 6, p-value = 0.004659215 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -12.46742939 -3.53257061 sample estimates: mean of x mean of y 19 27 > permTS(x2,y2) Exact Permutation Test (network algorithm) data: x2 and y2 p-value = 0.02857143 alternative hypothesis: true mean x2 - mean y2 is not equal to 0 sample estimates: mean x2 - mean y2 -8
Chi-square and McNemar test
We had five models for the hypothyroid test. We then calculated the accuracy and were satisfied with the numbers. Let's first look at the number of errors that the fitted model makes. We have 636 observations in the test partition and 42 of them test positive for the hypothyroid problem. Note that if we mark all the patients as negative, we would be getting an accuracy of 1-42/636 = 0.934, or about 93.4%. Using the table function, we pit the actuals against the predicted values and see how often the fitted model goes wrong. We remark here that identifying the hypothyroid cases as the same and the negative cases as negative is the correct prediction, while marking the hypothyroid case as negative and vice versa leads to errors. For each model, we look at the misclassification errors:
> table(LR_Predict_Bin,testY_numeric) testY_numeric LR_Predict_Bin 1 2 1 32 7 2 10 587 > table(NN_Predict,HT2_TestY) HT2_TestY NN_Predict hypothyroid negative hypothyroid 41 22 negative 1 572 > table(NB_predict,HT2_TestY) HT2_TestY NB_predict hypothyroid negative hypothyroid 33 8 negative 9 586 > table(CT_predict,HT2_TestY) HT2_TestY CT_predict hypothyroid negative hypothyroid 38 4 negative 4 590 > table(SVM_predict,HT2_TestY) HT2_TestY SVM_predict hypothyroid negative hypothyroid 34 2 negative 8 592
From the misclassification table, we can see that the neural network identifies 41 out of the 42 cases of hypothyroid correctly, but it identifies way more cases of hypothyroid incorrectly too. The question that arises is whether the correct predictions of the fitted models only occur by chance, or whether they depend on truth and can be explained. To test this, in the hypotheses framework we would like to test whether the actuals and predicted values of the actuals are independent of or dependent on each other. Technically, the null hypothesis is that the prediction is independent of the actual, and if a model explains the truth, the null hypothesis must be rejected. We should conclude that the fitted model predictions depend on the truth. We deploy two solutions here, the chi-square test and the McNemar test:
> chisq.test(table(LR_Predict_Bin,testY_numeric)) Pearson's Chi-squared test with Yates' continuity correction data: table(LR_Predict_Bin, testY_numeric) X-squared = 370.53501, df = 1, p-value < 0.00000000000000022204 > chisq.test(table(NN_Predict,HT2_TestY)) Pearson's Chi-squared test with Yates' continuity correction data: table(NN_Predict, HT2_TestY) X-squared = 377.22569, df = 1, p-value < 0.00000000000000022204 > chisq.test(table(NB_predict,HT2_TestY)) Pearson's Chi-squared test with Yates' continuity correction data: table(NB_predict, HT2_TestY) X-squared = 375.18659, df = 1, p-value < 0.00000000000000022204 > chisq.test(table(CT_predict,HT2_TestY)) Pearson's Chi-squared test with Yates' continuity correction data: table(CT_predict, HT2_TestY) X-squared = 498.44791, df = 1, p-value < 0.00000000000000022204 > chisq.test(table(SVM_predict,HT2_TestY)) Pearson's Chi-squared test with Yates' continuity correction data: table(SVM_predict, HT2_TestY) X-squared = 462.41803, df = 1, p-value < 0.00000000000000022204 > mcnemar.test(table(LR_Predict_Bin,testY_numeric)) McNemar's Chi-squared test with continuity correction data: table(LR_Predict_Bin, testY_numeric) McNemar's chi-squared = 0.23529412, df = 1, p-value = 0.6276258 > mcnemar.test(table(NN_Predict,HT2_TestY)) McNemar's Chi-squared test with continuity correction data: table(NN_Predict, HT2_TestY) McNemar's chi-squared = 17.391304, df = 1, p-value = 0.00003042146 > mcnemar.test(table(NB_predict,HT2_TestY)) McNemar's Chi-squared test with continuity correction data: table(NB_predict, HT2_TestY) McNemar's chi-squared = 0, df = 1, p-value = 1 > mcnemar.test(table(CT_predict,HT2_TestY)) McNemar's Chi-squared test data: table(CT_predict, HT2_TestY) McNemar's chi-squared = 0, df = 1, p-value = 1 > mcnemar.test(table(SVM_predict,HT2_TestY)) McNemar's Chi-squared test with continuity correction data: table(SVM_predict, HT2_TestY) McNemar's chi-squared = 2.5, df = 1, p-value = 0.1138463
The answer provided by the chi-square tests clearly shows that the predictions of each fitted model is not down to chance. It also shows that the prediction of hypothyroid cases, as well as the negative cases, is expected of the fitted models. The interpretation of and conclusions from the McNemar's test is left to the reader. The final important measure in classification problems is the ROC curve, which is considered next.
ROC test
The ROC curve is an important improvement on the false positive and true negative measures of model performance. For a detailed explanation, refer to Chapter 9 of Tattar et al. (2017). The ROC curve basically plots the true positive rate against the false positive rate, and we measure the AUC for the fitted model.
The main goal that the ROC test attempts to achieve is the following. Suppose that Model 1 gives an AUC of 0.89 and Model 2 gives 0.91. Using the simple AUC criteria, we outright conclude that Model 2 is better than Model 1. However, an important question that arises is whether 0.91 is significantly higher than 0.89. The roc.test
, from the pROC
R package, provides the answer here. For the neural network and classification tree, the following R segment gives the required answer:
> library(pROC) > HT_NN_Prob <- predict(NN_fit,newdata=HT2_TestX,type="raw") > HT_NN_roc <- roc(HT2_TestY,c(HT_NN_Prob)) > HT_NN_roc$auc Area under the curve: 0.9723826 > HT_CT_Prob <- predict(CT_fit,newdata=HT2_TestX,type="prob")[,2] > HT_CT_roc <- roc(HT2_TestY,HT_CT_Prob) > HT_CT_roc$auc Area under the curve: 0.9598765 > roc.test(HT_NN_roc,HT_CT_roc) DeLong's test for two correlated ROC curves data: HT_NN_roc and HT_CT_roc Z = 0.72452214, p-value = 0.4687452 alternative hypothesis: true difference in AUC is not equal to 0 sample estimates: AUC of roc1 AUC of roc2 0.9723825557 0.9598765432
Since the p-value is very large, we conclude that the AUC for the two models is not significantly different.
Statistical tests are vital and we recommend that they be used whenever suitable. The concepts highlighted in this chapter will be drawn on in more detail in the rest of the book.