Exploring random forests with scikit-learn
Now that we're near the end of this chapter, I would like to briefly discuss random forests. Random forests are not strictly ensemble algorithms because they are an extension of tree methods. However, unlike bagging decision trees, they are different in an important way.
In Chapter 10, Statistical Techniques for Tree-Based Methods, we discussed how splitting the nodes in a decision tree is a greedy approach. The greedy approach doesn't always yield the best possible tree and it's easy to overfit without proper penalization. The random forest algorithm does not only bootstrap the samples, but also the features. Let's take our stroke risk dataset as an example. The heavy weight is the optimal feature to split on, but this rules out 80% of all possible trees, along with the other features of the root node. The random forest algorithm, at every splitting decision point, samples a subset of the features and picks the best...