Tree ensembles
Non-parametric learning algorithms such as decision trees do not make any assumptions on the form of the learning function being learned and try to fit a model to the data at hand. However, decision trees run the risk of overfitting training data. Tree ensemble methods are a great way to leverage the benefits of decision trees while minimizing the risk of overfitting. Tree ensemble methods combine several decision trees to produce better-performing predictive models. Some popular tree ensemble methods include random forests and gradient boosted trees. We will explore how these ensemble methods can be used to build regression and classification models using Spark MLlib.
Regression using random forests
Random forests build multiple decision trees and merge them to produce a more accurate model and reduce the risk of overfitting. Random forests can be used to train regression models, as shown in the following code example:
from pyspark.ml.regression import RandomForestRegressor...