Ensemble methods and their working

Ensemble methods are based on a very simple idea: instead of using a single model to make a prediction, we use many models and then use some method to aggregate the predictions. Having different models is like having different points of view, and it has been demonstrated that by aggregating models that offer a different point of view; predictions can be more accurate. These methods further improve generalization over a single model because they reduce the risk of selecting a poorly performing classifier:

In the preceding diagram, we can see that each object belongs to one of three classes: triangles, circles, and squares. In this simplified example, we have two features to separate or classify the objects into the different classes. As you can see here, we can use three different classifiers and all the three classifiers represent different approaches and have different kinds of decision boundaries.

Ensemble learning combines all those individual predictions into a single one. The predictions made from combining the three boundaries usually have better properties than the ones produced by the individual models. This is the simple idea behind ensemble methods, also called ensemble learning.

The most commonly used ensemble methods are as follows:

Bootstrap sampling
Bagging
Random forests
Boosting

Before giving a high-level explanation of these methods, we need to discuss a very important statistical technique known as bootstrap sampling.

Bootstrap sampling

Many ensemble learning methods use a statistical technique called bootstrap sampling. A bootstrap sample of a dataset is another dataset that's obtained by randomly sampling the observations from the original dataset with replacement.

This technique is heavily used in statistics, for example; it is used for estimating standard errors on sample statistics like mean or standard deviation of values.

Let's understand this technique more by taking a look at the following diagram:

Let's assume that we have a population of 1 to 10, which can be considered original population data. To get a bootstrap sample, we need to draw 10 samples from the original data with replacement. Imagine you have the 10 numbers written in 10 cards in a hat; for the first element of your sample, you take one card at random from the hat and write it down, then put the card back in the hat and this process goes on until you get 10 elements. This is your bootstrap sample. As you can see in the preceding example, 9 is repeated thrice in the bootstrap sample.

This resampling of numbers with replacement improves the accuracy of the true population data. It also helps in understanding various discrepancies and features involved in the resampling process, thereby increasing accuracy of the same.

Bagging

Bagging, also known as bootstrap aggregation, is a general purpose procedure for reducing variance in the machine learning model. It is based on the bootstrap sampling technique and is generally used with regression or classification trees, but in principle this bagging technique can be used with any model.

The following steps are involved in the bagging process:

We choose the number of estimators or individual models to use. Let's consider this as parameter B.
We take sample datasets from B with replacement using the bootstrap sampling from the training set.
For every one of these training datasets, we fit the machine learning model in each of the bootstrap samples. This way, we get individual predictors for the B parameter.
We get the ensemble prediction by aggregating all of the individual predictions.

In the regression problem, the most common way to get the ensemble prediction would be to find the average of all of the individual predictions.

In the classification problem, the most common way to get the aggregated predictions is by doing a majority vote. The majority vote can be explained by an example. Let's say that we have 100 individual predictors and 80 of them vote for one particular category. Then, we choose that category as our aggregated prediction. This is what a majority vote means.

Random forests

This ensemble method is specifically created for regression or classification trees. It is very similar to bagging since, here, each individual tree is trained on a bootstrap sample of the training dataset. The difference with bagging is that it makes the model very powerful, and on splitting a node from the tree, the split that is picked is the best among a random subset of the features. So, every individual predictor considers a random subset of the features. This has the effect of making each individual predictor slightly worse and more biased but, due to the correlation of the individual predictors, the overall ensemble is generally better than the individual predictors.

Boosting

Boosting is another approach to ensemble learning. There are many methods for boosting, but one of the most successful and popular methods that people use for ensemble learning has been the AdaBoost algorithm. It is also called adaptive boosting. The core idea behind this algorithm is that, instead of fitting many individual predictors individually, we fit a sequence of weak learners. The next algorithm depends on the result of the previous one. In the AdaBoost algorithm, every iteration reweights all of these samples. The training data here reweights based on the result of the previous individual learners or individual models.

For example, in classification, the basic idea is that the examples that are misclassified gain weight and the examples that are classified correctly lose weight. So, the next learner in the sequence or the next model in the sequence focuses more on misclassified examples.