In this chapter, we covered the basic theoretical concepts for understanding tree ensembles and showed ways to train and evaluate these models in EMR, through Apache Spark, as well as through the SageMaker XGBoost service. Decision tree ensembles are one of the most popular classifiers, for many reasons:
- They are able to find complex patterns in relatively short training time and with few resources. The XGBoost library is known as the most popular classifier among Kaggle competition winners (these are competitions held to find the best model for an open dataset).
- It's possible to understand why the classifier is predicting a given value. Following the decision tree paths or just looking at the feature importance are quick ways to understand the rationale behind the decisions made by tree ensembles.
- Implementations of distributed training are available through Apache...