Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Hands-On Ensemble Learning with R

You're reading from   Hands-On Ensemble Learning with R A beginner's guide to combining the power of machine learning algorithms using ensemble techniques

Arrow left icon
Product type Paperback
Published in Jul 2018
Publisher Packt
ISBN-13 9781788624145
Length 376 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Prabhanjan Narayanachar Tattar Prabhanjan Narayanachar Tattar
Author Profile Icon Prabhanjan Narayanachar Tattar
Prabhanjan Narayanachar Tattar
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Introduction to Ensemble Techniques FREE CHAPTER 2. Bootstrapping 3. Bagging 4. Random Forests 5. The Bare Bones Boosting Algorithms 6. Boosting Refinements 7. The General Ensemble Technique 8. Ensemble Diagnostics 9. Ensembling Regression Models 10. Ensembling Survival Models 11. Ensembling Time Series Models 12. What's Next?
A. Bibliography Index

An ensemble purview

The caret R package is core to ensemble machine learning methods. It provides a large framework and we can also put different statistical and machine learning models together to create an ensemble. For the recent version of the package on the author's laptop, the caret package provides access to the following models:

> library(caret)
> names(getModelInfo())
  [1] "ada"                 "AdaBag"              "AdaBoost.M1" 
  [4] "adaboost"            "amdai"               "ANFIS" 
  [7] "avNNet"              "awnb"                "awtan"        
     
[229] "vbmpRadial"          "vglmAdjCat"          "vglmContRatio 
[232] "vglmCumulative"      "widekernelpls"       "WM" 
[235] "wsrf"                "xgbLinear"           "xgbTree" 
[238] "xyf"               

Depending on your requirements, you can choose any combination of these 238 models. The authors of the package keep on updating this list. It is to be noted that not all models will be available in the caret package, and that it is a platform that facilitates the ensembling of these methods. Consequently, if you choose a model such as ANFIS, and the R package frbs contains this function, which is not available on your machine, then caret will display a message on the terminal as indicated in the following snippet:

An ensemble purview

Figure 7: Caret providing a message to install the required R package

You need to key in the number 1 and continue. The package will be installed and loaded, and the program will continue. It is good to know the host of options for ensemble methods. A brief method for stack ensembling analytical models is provided here, and the details will unfold later in the book.

For the Hypothyroid dataset, we had a high accuracy of an average of 98% between the five models. The Waveform dataset saw an average accuracy of approximately 88%, while the average for German Credit data is 75%. We will try to increase the accuracy for this dataset. The accuracy improvement will be attempted using three models: naïve Bayes, logistic regression, and classification tree. First, we need to partition the data into three parts: train, test, and stack:

> load("../Data/GC2.RData")
> set.seed(12345)
> Train_Test_Stack <- sample(c("Train","Test","Stack"),nrow(GC2),replace = TRUE,prob = c(0.5,0.25,0.25))
> GC2_Train <- GC2[Train_Test_Stack=="Train",]
> GC2_Test <- GC2[Train_Test_Stack=="Test",]
> GC2_Stack <- GC2[Train_Test_Stack=="Stack",]The dependent and independent variables will be marked next in character vectors for programming convenient. 

> # set label name and Exhogenous
> Endogenous <- 'good_bad'
> Exhogenous <- names(GC2_Train)[names(GC2_Train) != Endogenous]

The model will be built on the training data first and accuracy will be assessed using the metric of Area Under Curve, the curve being the ROC. The control parameters will be set up first and the three models, naïve Bayes, classification tree, and logistic regression, will be created using the training dataset:

> # Creating a caret control object for the number of 
> # cross-validations to be performed
> myControl <- trainControl(method='cv', number=3, returnResamp='none')
> # train all the ensemble models with GC2_Train
> model_NB <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], 
+                    method='naive_bayes', trControl=myControl)
> model_rpart <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], 
+                      method='rpart', trControl=myControl)
> model_glm <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], 
+                        method='glm', trControl=myControl)

Predictions for the test and stack blocks are carried out next. We store the predicted probabilities along the test and stack data frames:

> # get predictions for each ensemble model for two last datasets
> # and add them back to themselves
> GC2_Test$NB_PROB <- predict(object=model_NB, GC2_Test[,Exhogenous],
+                              type="prob")[,1]
> GC2_Test$rf_PROB <- predict(object=model_rpart, GC2_Test[,Exhogenous],
+                             type="prob")[,1]
> GC2_Test$glm_PROB <- predict(object=model_glm, GC2_Test[,Exhogenous],
+                                  type="prob")[,1]
> GC2_Stack$NB_PROB <- predict(object=model_NB, GC2_Stack[,Exhogenous],
+                               type="prob")[,1]
> GC2_Stack$rf_PROB <- predict(object=model_rpart, GC2_Stack[,Exhogenous],
+                              type="prob")[,1]
> GC2_Stack$glm_PROB <- predict(object=model_glm, GC2_Stack[,Exhogenous],
+                                   type="prob")[,1]

The ROC is an important measure for model assessments. The higher the area under the ROC, the better the model would be. Note that these measures, or any other measure, will not be the same as the models fitted earlier since the data has changed:

> # see how each individual model performed on its own
> AUC_NB <- roc(GC2_Test[,Endogenous], GC2_Test$NB_PROB )
> AUC_NB$auc
Area under the curve: 0.7543
> AUC_rf <- roc(GC2_Test[,Endogenous], GC2_Test$rf_PROB )
> AUC_rf$auc
Area under the curve: 0.6777
> AUC_glm <- roc(GC2_Test[,Endogenous], GC2_Test$glm_PROB )
> AUC_glm$auc
Area under the curve: 0.7446

For the test dataset, we can see that the area under curve for the naïve Bayes, classification tree, and logistic regression are respectively 0.7543, 0.6777, and 0.7446. If we put the predicted values together in some format, and that leads to an increase in the accuracy, the purpose of the ensemble technique has been accomplished. As such, we consider the new predicted probabilities under the three models and append them to the stacked data frame. These three columns will now be treated as new input vectors. We then build a naïve Bayes model, an arbitrary choice, and you can try any other model (not necessarily restricted to one of these) for the stacked data frame. The AUC can then be predicted and calculated:

> # Stacking it together
> Exhogenous2 <- names(GC2_Stack)[names(GC2_Stack) != Endogenous]
> Stack_Model <- train(GC2_Stack[,Exhogenous2], GC2_Stack[,Endogenous], 
+                      method='naive_bayes', trControl=myControl)
> Stack_Prediction <- predict(object=Stack_Model,GC2_Test[,Exhogenous2],type="prob")[,1]
> Stack_AUC <- roc(GC2_Test[,Endogenous],Stack_Prediction)
> Stack_AUC$auc
Area under the curve: 0.7631

The AUC for the stacked data observations is higher than any of the earlier models, which is an improvement.

A host of questions should arise for the critical thinker. Why should this technique work? Will it lead to improvisations under all possible cases? If yes, will simply adding new model predictions lead to further improvements? If no, how does one pick the base models so that we can be reasonably assured of improvisations? What are the restrictions on the choice of models? We will provide solutions to most of these questions throughout this book. In the next section, we will quickly look at some useful statistical tests that will aid the assessment of model performance.

You have been reading a chapter from
Hands-On Ensemble Learning with R
Published in: Jul 2018
Publisher: Packt
ISBN-13: 9781788624145
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image