An ensemble purview
The caret
R package is core to ensemble machine learning methods. It provides a large framework and we can also put different statistical and machine learning models together to create an ensemble. For the recent version of the package on the author's laptop, the caret
package provides access to the following models:
> library(caret) > names(getModelInfo()) [1] "ada" "AdaBag" "AdaBoost.M1" [4] "adaboost" "amdai" "ANFIS" [7] "avNNet" "awnb" "awtan" [229] "vbmpRadial" "vglmAdjCat" "vglmContRatio [232] "vglmCumulative" "widekernelpls" "WM" [235] "wsrf" "xgbLinear" "xgbTree" [238] "xyf"
Depending on your requirements, you can choose any combination of these 238 models. The authors of the package keep on updating this list. It is to be noted that not all models will be available in the caret
package, and that it is a platform that facilitates the ensembling of these methods. Consequently, if you choose a model such as ANFIS
, and the R package frbs
contains this function, which is not available on your machine, then caret will display a message on the terminal as indicated in the following snippet:
You need to key in the number 1
and continue. The package will be installed and loaded, and the program will continue. It is good to know the host of options for ensemble methods. A brief method for stack ensembling analytical models is provided here, and the details will unfold later in the book.
For the Hypothyroid
dataset, we had a high accuracy of an average of 98% between the five models. The Waveform
dataset saw an average accuracy of approximately 88%, while the average for German
Credit data is 75%. We will try to increase the accuracy for this dataset. The accuracy improvement will be attempted using three models: naïve Bayes, logistic regression, and classification tree. First, we need to partition the data into three parts: train, test, and stack:
> load("../Data/GC2.RData") > set.seed(12345) > Train_Test_Stack <- sample(c("Train","Test","Stack"),nrow(GC2),replace = TRUE,prob = c(0.5,0.25,0.25)) > GC2_Train <- GC2[Train_Test_Stack=="Train",] > GC2_Test <- GC2[Train_Test_Stack=="Test",] > GC2_Stack <- GC2[Train_Test_Stack=="Stack",]The dependent and independent variables will be marked next in character vectors for programming convenient. > # set label name and Exhogenous > Endogenous <- 'good_bad' > Exhogenous <- names(GC2_Train)[names(GC2_Train) != Endogenous]
The model will be built on the training data first and accuracy will be assessed using the metric of Area Under Curve, the curve being the ROC. The control parameters will be set up first and the three models, naïve Bayes, classification tree, and logistic regression, will be created using the training dataset:
> # Creating a caret control object for the number of > # cross-validations to be performed > myControl <- trainControl(method='cv', number=3, returnResamp='none') > # train all the ensemble models with GC2_Train > model_NB <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], + method='naive_bayes', trControl=myControl) > model_rpart <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], + method='rpart', trControl=myControl) > model_glm <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], + method='glm', trControl=myControl)
Predictions for the test and stack blocks are carried out next. We store the predicted probabilities along the test and stack data frames:
> # get predictions for each ensemble model for two last datasets > # and add them back to themselves > GC2_Test$NB_PROB <- predict(object=model_NB, GC2_Test[,Exhogenous], + type="prob")[,1] > GC2_Test$rf_PROB <- predict(object=model_rpart, GC2_Test[,Exhogenous], + type="prob")[,1] > GC2_Test$glm_PROB <- predict(object=model_glm, GC2_Test[,Exhogenous], + type="prob")[,1] > GC2_Stack$NB_PROB <- predict(object=model_NB, GC2_Stack[,Exhogenous], + type="prob")[,1] > GC2_Stack$rf_PROB <- predict(object=model_rpart, GC2_Stack[,Exhogenous], + type="prob")[,1] > GC2_Stack$glm_PROB <- predict(object=model_glm, GC2_Stack[,Exhogenous], + type="prob")[,1]
The ROC is an important measure for model assessments. The higher the area under the ROC, the better the model would be. Note that these measures, or any other measure, will not be the same as the models fitted earlier since the data has changed:
> # see how each individual model performed on its own > AUC_NB <- roc(GC2_Test[,Endogenous], GC2_Test$NB_PROB ) > AUC_NB$auc Area under the curve: 0.7543 > AUC_rf <- roc(GC2_Test[,Endogenous], GC2_Test$rf_PROB ) > AUC_rf$auc Area under the curve: 0.6777 > AUC_glm <- roc(GC2_Test[,Endogenous], GC2_Test$glm_PROB ) > AUC_glm$auc Area under the curve: 0.7446
For the test
dataset, we can see that the area under curve for the naïve Bayes, classification tree, and logistic regression are respectively 0.7543
, 0.6777
, and 0.7446
. If we put the predicted values together in some format, and that leads to an increase in the accuracy, the purpose of the ensemble technique has been accomplished. As such, we consider the new predicted probabilities under the three models and append them to the stacked data frame. These three columns will now be treated as new input vectors. We then build a naïve Bayes model, an arbitrary choice, and you can try any other model (not necessarily restricted to one of these) for the stacked data frame. The AUC can then be predicted and calculated:
> # Stacking it together > Exhogenous2 <- names(GC2_Stack)[names(GC2_Stack) != Endogenous] > Stack_Model <- train(GC2_Stack[,Exhogenous2], GC2_Stack[,Endogenous], + method='naive_bayes', trControl=myControl) > Stack_Prediction <- predict(object=Stack_Model,GC2_Test[,Exhogenous2],type="prob")[,1] > Stack_AUC <- roc(GC2_Test[,Endogenous],Stack_Prediction) > Stack_AUC$auc Area under the curve: 0.7631
The AUC for the stacked data observations is higher than any of the earlier models, which is an improvement.
A host of questions should arise for the critical thinker. Why should this technique work? Will it lead to improvisations under all possible cases? If yes, will simply adding new model predictions lead to further improvements? If no, how does one pick the base models so that we can be reasonably assured of improvisations? What are the restrictions on the choice of models? We will provide solutions to most of these questions throughout this book. In the next section, we will quickly look at some useful statistical tests that will aid the assessment of model performance.