[box type="info" align="" class="" width=""]We are happy to bring you an elegant guest post on ensemble methods by Benjamin Rogojan, popularly known as The Seattle Data Guy.[/box]
How do data scientists improve their algorithm’s accuracy or improve the robustness of a model? A method that is tried and tested is ensemble learning. It is a must know topic if you claim to be a data scientist and/or a machine learning engineer. Especially, if you are planning to go in for a data science/machine learning interview.
Essentially, ensemble learning stays true to the meaning of the word ‘ensemble’. Rather than having several people who are singing at different octaves to create one beautiful harmony (each voice filling in the void of the other), ensemble learning uses hundreds to thousands of models of the same algorithm that work together to find the correct classification.
Another way to think about ensemble learning is the fable of the blind men and the elephant. Each blind man in the story seeks to identify the elephant in front of them. However, they work separately and come up with their own conclusions about the animal. Had they worked in unison, they might have been able to eventually figure out what they were looking at. Similarly, ensemble learning utilizes the workings of different algorithms and combines them for a successful and optimal classification.
Ensemble methods such as Boosting and Bagging have led to an increased robustness of statistical models with decreased variance.
Before we begin with explaining the various ensemble methods, let us have a glance at the common bond between them, Bootstrapping.
Explaining Bootstrapping can occasionally be missed by many data scientists. However, an understanding of bootstrapping is essential as both the ensemble methods, Boosting and Bagging, are based on the concept of bootstrapping.
Figure 1: Bootstrapping
In machine learning terms, bootstrap method refers to random sampling with replacement. This sample, after replacement, is referred as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances, and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics which the sample might have contained. This would, in turn, affect the overall mean, standard deviation, and other descriptive metrics of a data set. Ultimately, leading to the development of more robust models.
The above diagram depicts each sample population having different and non-identical pieces.
Bootstrapping is also great for small size data sets that may have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping are more robust and can handle new datasets depending on the methodology chosen (boosting or bagging).
The bootstrap method can also test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness. In certain cases, one sample data set may have a larger mean than another or a different standard deviation. This might break a model that was overfitted and not tested using data sets with different variations.
One of the many reasons bootstrapping has become so common is because of the increase in computing power. This allows multiple permutations to be done with different resamples.
Let us now move on to the most prominent ensemble methods: Bagging and Boosting.
Bagging actually refers to Bootstrap Aggregators. Most papers or posts that explain bagging algorithms are bound to refer to Leo Breiman’s work, a paper published in 1996 called “Bagging Predictors”.
In the paper, Leo describes bagging as:
“Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.”
Bagging helps reduce variance from models that are accurate only on the data they were trained on. This problem is also known as overfitting.
Overfitting happens when a function fits the data too well. Typically this is because the actual equation is highly complicated to take into account each data point and the outlier.
Figure 2: Overfitting
Another example of an algorithm that can overfit easily is a decision tree. The models that are developed using decision trees require very simple heuristics. Decision trees are composed of a set of if-else statements done in a specific order. Thus, if the data set is changed to a new data set that might have some bias or difference in the spread of underlying features compared to the previous set, the model will fail to be as accurate as before. This is because the data will not fit the model well.
Bagging gets around the overfitting problem by creating its own variance amongst the data. This is done by sampling and replacing data while it tests multiple hypotheses (models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes (median, average, etc).
Once each model has developed a hypothesis, the models use voting for classification or averaging for regression. This is where the “Aggregating” of the “Bootstrap Aggregating” comes into play. As in the figure shown below, each hypothesis has the same weight as all the others. (When we later discuss boosting, this is one of the places the two methodologies differ.)
Figure 3: Bagging
Essentially, all these models run at the same time and vote on the hypothesis which is the most accurate. This helps to decrease variance i.e. reduce the overfit.
Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Unlike bagging (that has each model run independently and then aggregate the outputs at the end without preference to any model), boosting is all about “teamwork”. Each model that runs dictates what features the next model will focus on.
Boosting also requires bootstrapping. However, there is another difference here. Unlike bagging, boosting weights each sample of data. This means some samples will be run more often than others.
Figure 4: Boosting
When boosting runs each model, it tracks which data samples are the most successful and which are not. The data sets with the most misclassified outputs are given heavier weights. This is because such data sets are considered to have more complexity. Thus, more iterations would be required to properly train the model.
During the actual classification stage, boosting tracks the model's error rates to ensure that better models are given better weights. That way, when the “voting” occurs, like in bagging, the models with better outcomes have a stronger pull on the final output.
Ensemble methods generally out-perform a single model. This is why many Kaggle winners have utilized ensemble methodologies. Another important ensemble methodology, not discussed here, is stacking.
Boosting and bagging are both great techniques to decrease variance. However, they won’t fix every problem, and they themselves have their own issues. There are different reasons why you would use one over the other.
Bagging is great for decreasing variance when a model is overfitted. However, boosting is likely to be a better pick of the two methods. This is because it is also great for decreasing bias in an underfit model. On the other hand, boosting is likely to suffer performance issues.
This is where experience and subject matter expertise comes in! It may seem easy to jump on the first model that works. However, it is important to analyze the algorithm and all the features it selects. For instance, a decision tree that sets specific leafs shouldn’t be implemented if it can’t be supported with other data points and visuals.
It is not just about trying AdaBoost, or Random forests on various datasets. The final algorithm is driven depending on the results an algorithm is getting and the support provided.
[author title="About the Author"]
Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/author]