You're reading from Hands-On Ensemble Learning with R A beginner's guide to combining the power of machine learning algorithms using ensemble techniques

Product type Paperback

Published in Jul 2018

Publisher Packt

ISBN-13 9781788624145

Length 376 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Author (1):

Prabhanjan Narayanachar Tattar

View More author details

Table of Contents (15) Chapters

Preface

1. Introduction to Ensemble Techniques FREE CHAPTER

2. Bootstrapping

3. Bagging

4. Random Forests

5. The Bare Bones Boosting Algorithms

6. Boosting Refinements

7. The General Ensemble Technique

8. Ensemble Diagnostics

9. Ensembling Regression Models

10. Ensembling Survival Models

11. Ensembling Time Series Models

12. What's Next?

A. Bibliography

R package references

Index

Complementary statistical tests

Here, a model is selected over another plausible one. The accuracy of one model seems higher than the other. The area under curve (AUC) of the ROC of a model is greater than that of another. However, it is not appropriate to base the conclusion on pure numbers only. It is important to conclude whether the numbers hold significance from the point of view of statistical inference. In the analytical world, it is pivotal that we make use of statistical tests whenever they are available to validate claims/hypotheses. A reason for using statistical tests is that probability can be highly counterintuitive, and what appears on the surface might not be the case upon closer inspection, after incorporating the chance variation. For instance, if a fair coin is tossed 100 times, it is imprudent to think that the number of heads must be exactly 50. Hence, if a fair coin shows up 45 heads, we need to incorporate the chance variation that the number of heads can be less than 50 too. Caution must be exerted all the while when we are dealing with uncertain data. A few examples are in order here. Two variables might appear to be independent of each other, and the correlation might also be nearly equal to zero. However, applying the correlation test might result in the conclusion that the correlation is not significantly zero. Since we will be sampling and resampling a lot in this text, we will look at related tests.

Permutation test

Suppose that we have two processes, A and B, and the variances of these two processes are known to be equal, though unknown. Three independent observations from process A result in yields of 18, 20, and 22, while three independent observations from process B gives yields of 24, 26, and 28. Under the assumption that the yield follows a normal distribution, we would like to test whether the means of processes A and B are the same. This is a suitable case for applying the t-test, since the number of observations is smaller. An application of the t.test function shows that the two means are different to each other, and this intuitively appears to be the case.

Now, the assumption under the null hypothesis is that the means are equal, and that the variance is unknown and assumed to be equal under the two processes. Consequently, we have a genuine reason to believe that the observations from process A might well have occurred in process B too, and vice versa. We can therefore swap one observation in process B with process A, and recompute the t-test. The process can be repeated for all possible permutations of the two samples. In general, if we have m samples from population 1 and n samples from population 2, we can have

different samples and as many tests. An overall test can be based on such permutation samples and such tests are called permutation tests.

For process A and B observations, we will first apply the t-test and then the permutation test. The t.test is available in the core stats package and the permutation t-test is taken from the perm package:

> library(perm)
> x <- c(18,20,22); y <- c(24,26,28)
> t.test(x,y,var.equal = TRUE)
Two Sample t-test
data:  x and y
t = -3.6742346, df = 4, p-value = 0.02131164
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.533915871  -1.466084129
sample estimates:
mean of x mean of y 
       20        26

The smaller p-value suggests that the means of processes A and B are not equal. Consequently, we now apply the permutation test permTS from the perm package:

> permTS(x,y)
Exact Permutation Test (network algorithm)
data:  x and y
p-value = 0.1
alternative hypothesis: true mean x - mean y is not equal to 0
sample estimates:
mean x - mean y 
             -6

The p-value is now at 0.1, which means that the permutation test concludes that the means of the processes are equal. Does this mean that the permutation test will always lead to this conclusion, contradicting the t-test? The answer is given in the next code segment:

> x2 <- c(16,18,20,22); y2 <- c(24,26,28,30)
> t.test(x2,y2,var.equal = TRUE)
Two Sample t-test
data:  x2 and y2
t = -4.3817805, df = 6, p-value = 0.004659215
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -12.46742939  -3.53257061
sample estimates:
mean of x mean of y 
       19        27 
> permTS(x2,y2)
Exact Permutation Test (network algorithm)
data:  x2 and y2
p-value = 0.02857143
alternative hypothesis: true mean x2 - mean y2 is not equal to 0
sample estimates:
mean x2 - mean y2 
               -8

Chi-square and McNemar test

We had five models for the hypothyroid test. We then calculated the accuracy and were satisfied with the numbers. Let's first look at the number of errors that the fitted model makes. We have 636 observations in the test partition and 42 of them test positive for the hypothyroid problem. Note that if we mark all the patients as negative, we would be getting an accuracy of 1-42/636 = 0.934, or about 93.4%. Using the table function, we pit the actuals against the predicted values and see how often the fitted model goes wrong. We remark here that identifying the hypothyroid cases as the same and the negative cases as negative is the correct prediction, while marking the hypothyroid case as negative and vice versa leads to errors. For each model, we look at the misclassification errors:

> table(LR_Predict_Bin,testY_numeric)
              testY_numeric
LR_Predict_Bin   1   2
             1  32   7
             2  10 587
> table(NN_Predict,HT2_TestY)
             HT2_TestY
NN_Predict    hypothyroid negative
  hypothyroid          41       22
  negative              1      572
> table(NB_predict,HT2_TestY)
             HT2_TestY
NB_predict    hypothyroid negative
  hypothyroid          33        8
  negative              9      586
> table(CT_predict,HT2_TestY)
             HT2_TestY
CT_predict    hypothyroid negative
  hypothyroid          38        4
  negative              4      590
> table(SVM_predict,HT2_TestY)
             HT2_TestY
SVM_predict   hypothyroid negative
  hypothyroid          34        2
  negative              8      592

From the misclassification table, we can see that the neural network identifies 41 out of the 42 cases of hypothyroid correctly, but it identifies way more cases of hypothyroid incorrectly too. The question that arises is whether the correct predictions of the fitted models only occur by chance, or whether they depend on truth and can be explained. To test this, in the hypotheses framework we would like to test whether the actuals and predicted values of the actuals are independent of or dependent on each other. Technically, the null hypothesis is that the prediction is independent of the actual, and if a model explains the truth, the null hypothesis must be rejected. We should conclude that the fitted model predictions depend on the truth. We deploy two solutions here, the chi-square test and the McNemar test:

> chisq.test(table(LR_Predict_Bin,testY_numeric))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(LR_Predict_Bin, testY_numeric)
X-squared = 370.53501, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(NN_Predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(NN_Predict, HT2_TestY)
X-squared = 377.22569, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(NB_predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(NB_predict, HT2_TestY)
X-squared = 375.18659, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(CT_predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(CT_predict, HT2_TestY)
X-squared = 498.44791, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(SVM_predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(SVM_predict, HT2_TestY)
X-squared = 462.41803, df = 1, p-value < 0.00000000000000022204
> mcnemar.test(table(LR_Predict_Bin,testY_numeric))
McNemar's Chi-squared test with continuity correction
data:  table(LR_Predict_Bin, testY_numeric)
McNemar's chi-squared = 0.23529412, df = 1, p-value = 0.6276258
> mcnemar.test(table(NN_Predict,HT2_TestY))
McNemar's Chi-squared test with continuity correction
data:  table(NN_Predict, HT2_TestY)
McNemar's chi-squared = 17.391304, df = 1, p-value = 0.00003042146
> mcnemar.test(table(NB_predict,HT2_TestY))
McNemar's Chi-squared test with continuity correction
data:  table(NB_predict, HT2_TestY)
McNemar's chi-squared = 0, df = 1, p-value = 1
> mcnemar.test(table(CT_predict,HT2_TestY))
McNemar's Chi-squared test
data:  table(CT_predict, HT2_TestY)
McNemar's chi-squared = 0, df = 1, p-value = 1
> mcnemar.test(table(SVM_predict,HT2_TestY))
McNemar's Chi-squared test with continuity correction
data:  table(SVM_predict, HT2_TestY)
McNemar's chi-squared = 2.5, df = 1, p-value = 0.1138463

The answer provided by the chi-square tests clearly shows that the predictions of each fitted model is not down to chance. It also shows that the prediction of hypothyroid cases, as well as the negative cases, is expected of the fitted models. The interpretation of and conclusions from the McNemar's test is left to the reader. The final important measure in classification problems is the ROC curve, which is considered next.

ROC test

The ROC curve is an important improvement on the false positive and true negative measures of model performance. For a detailed explanation, refer to Chapter 9 of Tattar et al. (2017). The ROC curve basically plots the true positive rate against the false positive rate, and we measure the AUC for the fitted model.

The main goal that the ROC test attempts to achieve is the following. Suppose that Model 1 gives an AUC of 0.89 and Model 2 gives 0.91. Using the simple AUC criteria, we outright conclude that Model 2 is better than Model 1. However, an important question that arises is whether 0.91 is significantly higher than 0.89. The roc.test, from the pROC R package, provides the answer here. For the neural network and classification tree, the following R segment gives the required answer:

> library(pROC)
> HT_NN_Prob <- predict(NN_fit,newdata=HT2_TestX,type="raw")
> HT_NN_roc <- roc(HT2_TestY,c(HT_NN_Prob))
> HT_NN_roc$auc
Area under the curve: 0.9723826
> HT_CT_Prob <- predict(CT_fit,newdata=HT2_TestX,type="prob")[,2]
> HT_CT_roc <- roc(HT2_TestY,HT_CT_Prob)
> HT_CT_roc$auc
Area under the curve: 0.9598765
> roc.test(HT_NN_roc,HT_CT_roc)
	DeLong's test for two correlated ROC curves
data:  HT_NN_roc and HT_CT_roc
Z = 0.72452214, p-value = 0.4687452
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
 AUC of roc1  AUC of roc2 
0.9723825557 0.9598765432

Since the p-value is very large, we conclude that the AUC for the two models is not significantly different.

Statistical tests are vital and we recommend that they be used whenever suitable. The concepts highlighted in this chapter will be drawn on in more detail in the rest of the book.