Chapter 1. Introduction to Ensemble Techniques
Ensemble techniques are model output aggregating techniques that have evolved over the past decade and a half in the area of statistical and machine learning. This forms the central theme of this book. Any user of statistical models and machine learning tools will be familiar with the problem of building a model and the vital decision of choosing among potential candidate models. A model's accuracy is certainly not the only relevant criterion; we are also concerned with its complexity, as well as whether or not the overall model makes practical sense.
Common modeling problems include the decision to choose a model, and various methodologies exist to aid this task. In statistics, we resort to measures such as Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC), and on other fronts, the p-value associated with the variable in the fitted model helps with the decision. This is a process generally known as model selection. Ridge penalty, Lasso, and other statistics also help with this task. For machine learning models such as neural networks, decision trees, and so on, a k-fold cross-validation is useful when the model is built using a part of the data referred to as training data, and then accuracy is looked for in the untrained area or validation data. If the model is sensitive to its complexity, the exercise could be futile.
The process of obtaining the best model means that we create a host of other models, which are themselves nearly as efficient as the best model. Moreover, the best model accurately covers the majority of samples, and other models might accurately assess the variable space region where it is inaccurate. Consequently, we can see that the final shortlisted model has few advantages over the runner up. The next models in line are not so poor as to merit outright rejection. This makes it necessary to find a way of taking most of the results already obtained from the models and combining them in a meaningful way. The search for a method for putting together various models is the main objective of ensemble learning. Alternatively, one can say that ensemble learning transforms competing models into collaborating models. In fact, ensemble techniques are not the end of the modeling exercise, as they will also be extended to the unsupervised learning problems. We will demonstrate an example that justifies the need for this.
The implementation of ensemble methods would have been impossible without the invention of modern computational power. Statistical methods foresaw techniques that required immense computations. Methods such as permutation tests and jackknife are evidence of the effectiveness of computational power. We will undertake an exercise to learn these later in the chapter, and we will revisit them later on in the book.
From a machine learning perspective, supervised and unsupervised are the two main types of learning technique. Supervised learning is the arm of machine learning, the process in which a certain variable is known, and the purpose is to understand this variable through various other variables. Here, we have a target variable. Since learning takes place with respect to the output variable, supervised learning is sometimes referred to as learning with a teacher. All target variables are not alike, and they often fall under one of the following four types. If the goal is to classify observations into one of k types of class (for example, Yes/No, Satisfied/Dissatisfied), then we have a classification problem. Such a variable is referred to as a categorical variable in statistics. It is possible that the variable of interest might be a continuous variable, which is numeric from a software perspective. This may include car mileage per liter, a person's income, or a person's age. For such scenarios, the purpose of the machine learning problem is to learn the variables in terms of other associated variables, and then predict it for unknown cases in which only the values of associated variables are available. We will broadly refer to this class of problem as a regression problem.
In clinical trials, the time to event is often of interest. When an illness is diagnosed, we would ask whether the proposed drug is an improvement on the existing one. While the variable in question here is the length of time between diagnosis and death, clinical trial data poses several other problems. The analysis cannot wait until all the patients have died, and/or some of the patients may have moved away from the study, making it no longer possible to know their status. Consequently, we have censored data. As part of the study observations, complete information is not available. Survival analysis largely deals with such problems, and we will undertake the problem of creating ensemble models here.
With classification, regression, and survival data, it may be assumed that that the instances/observations are independent of each other. This is a very reasonable assumption in that there is a valid reason to believe that patients will respond to a drug independently of other patients, a customer will churn or pay the loan independently of other customers, and so forth. In yet another important class of problems, this assumption is not met, and we are left with observations depending on each other via time series data. An example of time series data is the closure stock exchange points of a company. Clearly, the performance of a company's stock can't be independent each day, and thus we need to factor in dependency.
In many practical problems, the goal is to understand patterns or find groups of observations, and we don't have a specific variable of interest with regard to which algorithm needs to be trained. Finding groups or clusters is referred to as unsupervised learning or learning without a teacher. Two main practical problems that arise in finding clusters is that (i) it is generally not known in advance how many clusters are in the population, and (ii) different choices of initial cluster centers lead to different solutions. Thus, we need a solution that is free from, or at least indifferent to, initialization and takes the positives of each useful solution into consideration. This will lead us toward unsupervised ensemble techniques.
The search for the best models, supervised or unsupervised, is often hindered by the presence of outliers. The presence of a single outlier is known to heavily influence the overall fit of linear models, and it is also known to significantly impact even nonlinear models. Outlier detection is a challenge in itself, and a huge body of statistical methods help in identifying outliers. A host of machine learning methods also help in identifying outliers. Of course, ensembles will help here, and we will develop R programs that will help solve the problem of identifying outliers. This method will be referred to as outlier ensembles.
At the outset, it is important that the reader becomes familiar with the datasets used in this book. All major datasets will be introduced in the first section. We begin the chapter with a brief introduction to the core statistical/machine learning models and put them into action immediately afterward. It will quickly become apparent that there is not a single class of model that would perform better than any other model. If any such solution existed, we wouldn't need the ensemble technique.
In this chapter, we will cover:
- Datasets: The core datasets that will be used throughout the book
- Statistical/machine learning models: Important classification models will be explained here
- The right model dilemma: The absence of a dominating model
- An ensemble purview: The need for ensembles
- Complementary statistical tests: Important statistical tests that will be useful for model comparisons will be discussed here
The following R packages will be required for this chapter:
ACSWR
caret
e1071
factoextra
mlbench
NeuralNetTools
perm
pROC
RSADBE
Rpart
survival
nnet