[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Statistics for Data Science, written by James D. Miller. This book is a comprehensive primer on the basic concepts of statistics and their application in different data science tasks.[/box]
In this article, we explain the implementation of boosting - a popular technique used to improve the performance of a data model - using the popular R programming language.
We will take a high-level look at a thought-provoking prediction problem drawn from Mastering Predictive Analytics with R, Second Edition, by James D. Miller and Rui Miguel Forte.
Here, an original example of patterns made by radiation on a telescope camera are analyzed in an attempt to predict whether a certain pattern came from gamma rays leaking into the atmosphere or from regular background radiation.
Gamma rays leave distinctive elliptical patterns and so we can create a set of features to describe these. The dataset used is the MAGIC Gamma Telescope Data Set, hosted by the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope
This data consists of 19,020 observations, holding the following list of attributes:
First, various steps need to be performed on our example data.
The data is first loaded into an R data frame object named magic, recoding the CLASS output variable to use classes 1 and -1 for gamma rays and background radiation respectively:
> magic <- read.csv("magic04.data", header = FALSE)
> names(magic) <- c("FLENGTH", "FWIDTH", "FSIZE", "FCONC", "FCONC1",
"FASYM", "FM3LONG", "FM3TRANS", "FALPHA", "FDIST", "CLASS")
> magic$CLASS <- as.factor(ifelse(magic$CLASS =='g', 1, -1))
Next, the data is split into two files: a training data and a test data frame using an 80-20 split:
> library(caret)
> set.seed(33711209)
> magic_sampling_vector <- createDataPartition(magic$CLASS,
p = 0.80, list = FALSE)
> magic_train <- magic[magic_sampling_vector, 1:10]
> magic_train_output <- magic[magic_sampling_vector, 11]
> magic_test <- magic[-magic_sampling_vector, 1:10]
> magic_test_output <- magic[-magic_sampling_vector, 11]
The model used for boosting is a simple multilayer perceptron with a single hidden layer leveraging R's nnet package.
Neural networks, often produce higher accuracy when inputs are normalized, so, in this example, before training any models, this preprocessing is performed:
> magic_pp <- preProcess(magic_train, method = c("center",
"scale"))
> magic_train_pp <- predict(magic_pp, magic_train)
> magic_train_df_pp <- cbind(magic_train_pp,
CLASS = magic_train_output)
> magic_test_pp <- predict(magic_pp, magic_test)
Boosting is designed to work best with weak learners, so a very small number of hidden neurons in the model's hidden layer are used.
Concretely, we will begin with the simplest possible multilayer perceptron that uses a single hidden neuron. To understand the effect of using boosting, a baseline performance is established by training a single neural network (and measuring its performance).
This is to accomplish the following:
> library(nnet)
> n_model <- nnet(CLASS ~ ., data = magic_train_df_pp, size = 1)
> n_test_predictions <- predict(n_model, magic_test_pp,
type = "class")
> (n_test_accuracy <- mean(n_test_predictions ==
magic_test_output))
[1] 0.7948988
This establishes that we have a baseline accuracy of around 79.5 percent. Not too bad, but can boost to improve upon this score?
To that end, the function AdaBoostNN()
, which is shown as follows, is used. This function will take input from a data frame, the name of the output variable, the number of single hidden layer neural network models to be built, and finally, the number of hidden units these neural networks will have.
The function will then implement the AdaBoost
algorithm and return a list of models with their corresponding weights.
Here is the function:
AdaBoostNN <- function(training_data, output_column, M,
hidden_units) {
require("nnet")
models <- list()
alphas <- list()
n <- nrow(training_data)
model_formula <- as.formula(paste(output_column, '~ .', sep = ''))
w <- rep((1/n), n)
for (m in 1:M) {
model <- nnet(model_formula, data = training_data,
size = hidden_units, weights = w)
models[[m]] <- model
predictions <- as.numeric(predict(model,
training_data[, -which(names(training_data) ==
output_column)], type = "class"))
errors <- predictions != training_data[, output_column]
error_rate <- sum(w * as.numeric(errors)) / sum(w)
alpha <- 0.5 * log((1 - error_rate) / error_rate)
alphas[[m]] <- alpha
temp_w <- mapply(function(x, y) if (y) { x * exp(alpha) }
else { x * exp(-alpha)}, w, errors)
w <- temp_w / sum(temp_w)
}
return(list(models = models, alphas = unlist(alphas)))
}
The preceding function uses the following logic:
There is now a function able to train our ensemble classifier using AdaBoost, but we also need a function to make the actual predictions. This function will take in the output list produced by our training function, AdaBoostNN(), along with a test dataset.
This function is AdaBoostNN.predict() and it is shown as follows:
AdaBoostNN.predict <- function(ada_model, test_data) {
models <- ada_model$models
alphas <- ada_model$alphas
prediction_matrix <- sapply(models, function (x)
as.numeric(predict(x, test_data, type = "class")))
weighted_predictions <- t(apply(prediction_matrix, 1,
function(x) mapply(function(y, z) y * z, x, alphas)))
final_predictions <- apply(weighted_predictions, 1,
function(x) sign(sum(x)))
return(final_predictions)
}
This function first extracts the models and the model weights (from the list produced by the previous function). A matrix of predictions is created, where each column corresponds to the vector of predictions made by a particular model. Thus, there will be as many columns in this matrix as the models that we used for boosting.
We then multiply the predictions produced by each model with their corresponding model weight. For example, every prediction from the first model is in the first column of the prediction matrix and will have its value multiplied by the first model weight α1.
.Lastly, the matrix of weighted observations is reduced into a single vector of observations by summing the weighted predictions for each observation and taking the sign of the result. This vector of predictions is then returned by the function.
As an experiment, we will train ten neural network models with a single hidden unit and see if boosting improves accuracy:
> ada_model <- AdaBoostNN(magic_train_df_pp, 'CLASS', 10, 1)
> predictions <- AdaBoostNN.predict(ada_model, magic_test_pp,
'CLASS')
> mean(predictions == magic_test_output)
[1] 0.804365
We see in this example, boosting ten models shows a marginal improvement in accuracy, but perhaps training more models might make more of a difference.
From the preceding example, you may conclude that, for the neural networks with one hidden unit, as the number of boosting models increases, we see an improvement in accuracy, but after 100 models, this tapers off and is actually slightly less for 200 models. The improvement over the baseline of a single model is substantial for these networks. When we increase the complexity of our learner by having a hidden layer with three hidden neurons, we get a much smaller improvement in performance. At 200 models, both ensembles perform at a similar level, indicating that, at this point, our accuracy is being limited by the type of model trained.
If you found this article useful, make sure to check out the book Statistics for Data Science for interesting statistical techniques and their implementation in R.