Ensembles
As the name suggests, ensemble methods use multiple learning algorithms to obtain a more accurate model in terms of prediction accuracy. Usually these techniques require more computing power and make the model more complex, which makes it difficult to interpret. Let us discuss the various types of ensemble techniques available on Spark.
Random forests
A random forest is an ensemble technique for the decision trees. Before we get to random forests, let us see how it has evolved. We know that decision trees usually have high variance issues and tend to overfit the model. To address this, a concept called bagging (also known as bootstrap aggregating) was introduced. For the decision trees, the idea was to take multiple training sets (bootstrapped training sets) from the dataset and create separate decision trees out of those, and then average them out for regression trees. For the classification trees, we can take the majority vote or the most commonly occurring class from all the trees...