Random Forest is an ensemble learning technique used for solving supervised learning tasks, such as classification and regression. An advantageous feature of Random Forest is that it can overcome the overfitting problem across its training dataset. A forest in Random Forest usually consists of hundreds of thousands of trees. These trees are actually trained on different parts of the same training set.
More technically, an individual tree that grows very deep tends to learn from highly unpredictable patterns. This creates overfitting problems on the training sets. Moreover, low biases make the classifier a low performer even if your dataset quality is good in terms of the features presented. On the other hand, an Random Forest helps to average multiple decision trees together with the goal of reducing the variance to ensure consistency by computing proximities between pairs of cases.
GBT or
Random Forest? Although both GBT and Random Forest are ensembles of trees, the training processes are different. There are several practical trade-offs that exist, which often poses the dilemma of which one to choose. However, Random Forest would be the winner in most cases. Here are some justifications:
- GBTs train one tree at a time, but Random Forest can train multiple trees in parallel. So the training time is lower for RF. However, in some special cases, training and using a smaller number of trees with GBTs is easier and quicker.
- RFs are less prone to overfitting in most cases, so it reduces the likelihood of overfitting. In other words, Random Forest reduces variance with more trees, but GBTs reduce bias with more trees.
- Finally, Random Forest can be easier to tune since performance improves monotonically with the number of trees, but GBT performs badly with an increased number of trees.
However, this slightly increases bias and makes it harder to interpret the results. But eventually, the performance of the final model increases dramatically. While using the Random Forest as a classifier, there are some parameter settings:
- If the number of trees is 1, then no bootstrapping is used at all; however, if the number of trees is > 1, then bootstrapping is needed. The supported values are auto, all, sqrt, log2, and onethird.
- The supported numerical values are (0.0-1.0) and [1-n]. However, if featureSubsetStrategy is chosen as auto, the algorithm chooses the best feature subset strategy automatically.
- If the numTrees == 1, the featureSubsetStrategy is set to be all. However, if the numTrees > 1 (that is, forest), the featureSubsetStrategy is set to be sqrt for classification.
- Moreover, if a real value n is set in the range of (0, 1.0), n*number_of_features will be used. However, if an integer value n is in the range (1, the number of features) is set, only n features are used alternatively.
- The parameter categoricalFeaturesInfo is a map used for storing arbitrary or of categorical features. An entry (n -> k) indicates that feature n is categorical with I categories indexed from 0: (0, 1,...,k-1).
- The impurity criterion is used for information gain calculation. The supported values are gini and variance for classification and regression respectively.
- The maxDepth is the maximum depth of the tree (for example, depth 0 means one leaf node, depth 1 means one internal node plus two leaf nodes).
- The maxBins signifies the maximum number of bins used for splitting the features, where the suggested value is 100 to get better results.
- Finally, the random seed is used for bootstrapping and choosing feature subsets to avoid the random nature of the results.
As already mentioned, since Random Forest is fast and scalable enough for a large-scale dataset, Spark is a suitable technology to implement the RF, and to implement this massive scalability. However, if the proximities are calculated, storage requirements also grow exponentially.
Well, that's enough about RF. Now it's time to get our hands dirty, so let's get started. We begin with importing required libraries:
import org.apache.spark.ml.regression.{RandomForestRegressor, RandomForestRegressionModel}
import org.apache.spark.ml.{ Pipeline, PipelineModel }
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.evaluation.RegressionMetrics
Then we create an active Spark session and import implicits:
val spark = SparkSessionCreate.createSession()
import spark.implicits._
Then we define some hyperparameters, such as the number of folds for cross-validation, number of maximum iterations, the value of regression parameters, value of tolerance, and elastic network parameters, as follows:
val NumTrees = Seq(5,10,15)
val MaxBins = Seq(23,27,30)
val numFolds = 10
val MaxIter: Seq[Int] = Seq(20)
val MaxDepth: Seq[Int] = Seq(20)
Note that for an Random Forest based on a decision tree, we require maxBins to be at least as large as the number of values in each categorical feature. In our dataset, we have 110 categorical features with 23 distinct values. Considering this, we have to set MaxBins to at least 23. Nevertheless, feel free to play with the previous parameters too. Alright, now it's time to create an LR estimator:
val model = new RandomForestRegressor().setFeaturesCol("features").setLabelCol("label")
Now let's build a pipeline estimator by chaining the transformer and the LR estimator:
println("Building ML pipeline")
val pipeline = new Pipeline().setStages((Preproessing.stringIndexerStages :+ Preproessing.assembler) :+ model)
Before we start performing the cross-validation, we need to have a paramgrid. So let's start creating the paramgrid by specifying the number of trees, a number for maximum tree depth, and the number of maximum bins parameters, as follows:
val paramGrid = new ParamGridBuilder()
.addGrid(model.numTrees, NumTrees)
.addGrid(model.maxDepth, MaxDepth)
.addGrid(model.maxBins, MaxBins)
.build()
Now, for better and stable performance, let's prepare the K-fold cross-validation and grid search as a part of model tuning. As you can probably guess, I am going to perform 10-fold cross-validation. Feel free to adjust the number of folds based on your settings and dataset:
println("Preparing K-fold Cross Validation and Grid Search: Model tuning")
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new RegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds)
Fantastic, we have created the cross-validation estimator. Now it's time to train the LR model:
println("Training model with Random Forest algorithm")
val cvModel = cv.fit(Preproessing.trainingData)
Now that we have the fitted model, that means it is now capable of making predictions. So let's start evaluating the model on the train and validation set, and calculating RMSE, MSE, MAE, R-squared, and many more:
println("Evaluating model on train and validation set and calculating RMSE")
val trainPredictionsAndLabels = cvModel.transform(Preproessing.trainingData).select("label", "prediction").map { case Row(label: Double, prediction: Double) => (label, prediction) }.rdd
val validPredictionsAndLabels = cvModel.transform(Preproessing.validationData).select("label", "prediction").map { case Row(label: Double, prediction: Double) => (label, prediction) }.rdd
val trainRegressionMetrics = new RegressionMetrics(trainPredictionsAndLabels)
val validRegressionMetrics = new RegressionMetrics(validPredictionsAndLabels)
Great! We have managed to compute the raw prediction on the train and the test set. Let's hunt for the best model:
val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]
As already stated, by using RF, it is possible to measure the feature importance so that at a later stage, we can decide which features should be used and which ones are to be dropped from the DataFrame. Let's find the feature importance from the best model we just created for all features in ascending order, as follows:
val featureImportances = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel].featureImportances.toArray
val FI_to_List_sorted = featureImportances.toList.sorted.toArray
Once we have the best fitted and cross-validated model, we can expect a good prediction accuracy. Now let's observe the results on the train and the validation set:
val output = "n=====================================================================n" + s"Param trainSample: ${Preproessing.trainSample}n" +
s"Param testSample: ${Preproessing.testSample}n" +
s"TrainingData count: ${Preproessing.trainingData.count}n" +
s"ValidationData count: ${Preproessing.validationData.count}n" +
s"TestData count: ${Preproessing.testData.count}n" + "=====================================================================n" + s"Param maxIter = ${MaxIter.mkString(",")}n" +
s"Param maxDepth = ${MaxDepth.mkString(",")}n" +
s"Param numFolds = ${numFolds}n" + "=====================================================================n" + s"Training data MSE = ${trainRegressionMetrics.meanSquaredError}n" +
s"Training data RMSE = ${trainRegressionMetrics.rootMeanSquaredError}n" +
s"Training data R-squared = ${trainRegressionMetrics.r2}n" +
s"Training data MAE = ${trainRegressionMetrics.meanAbsoluteError}n" +
s"Training data Explained variance = ${trainRegressionMetrics.explainedVariance}n" + "=====================================================================n" + s"Validation data MSE = ${validRegressionMetrics.meanSquaredError}n" +
s"Validation data RMSE = ${validRegressionMetrics.rootMeanSquaredError}n" +
s"Validation data R-squared = ${validRegressionMetrics.r2}n" +
s"Validation data MAE = ${validRegressionMetrics.meanAbsoluteError}n" +
s"Validation data Explained variance =
${validRegressionMetrics.explainedVariance}n" + "=====================================================================n" + s"CV params explained: ${cvModel.explainParams}n" +
s"RF params explained: ${bestModel.stages.last.asInstanceOf[RandomForestRegressionModel].explainParams}n" +
s"RF features importances:n ${Preproessing.featureCols.zip(FI_to_List_sorted).map(t => s"t${t._1} = ${t._2}").mkString("n")}n" + "=====================================================================n"
Now, we print the preceding results as follows:
println(results)
>>>
Param trainSample: 1.0
Param testSample: 1.0
TrainingData count: 141194
ValidationData count: 47124
TestData count: 125546
Param maxIter = 20
Param maxDepth = 20
Param numFolds = 10
Training data MSE = 1340574.3409399686
Training data RMSE = 1157.8317412042081
Training data R-squared = 0.7642745310548124
Training data MAE = 809.5917285994619
Training data Explained variance = 8337897.224852404
Validation data MSE = 4312608.024875177
Validation data RMSE = 2076.6819749001475
Validation data R-squared = 0.1369507149716651"
Validation data MAE = 1273.0714382935894
Validation data Explained variance = 8737233.110450774
So our predictive model shows an MAE of about 809.5917285994619 and 1273.0714382935894 for the training and test set respectively. The last result is important for understanding the feature importance (the preceding list is abridged to save space but you should receive the full list).
I have drawn both the categorical and continuous features, and their respective importance in Python, so I will not show the code here but only the graph. Let's see the categorical features showing feature importance as well as the corresponding feature number:
Figure 11: Random Forest categorical feature importance
From the preceding graph, it is clear that categorical features cat20, cat64, cat47, and cat69 are less important. Therefore, it would make sense to drop these features and retrain the Random Forest model to observe better performance.
Now let's see how the continuous features are correlated and contribute to the loss column. From the following figure, we can see that all continuous features are positively correlated with the loss column. This also signifies that these continuous features are not that important compared to the categorical ones we have seen in the preceding figure:
Figure 12: Correlations between the continuous features and the label
What we can learn from these two analyses is that we can naively drop some unimportant columns and train the Random Forest model to observe if there is any reduction in the MAE value for both the training and validation set. Finally, let's make a prediction on the test set:
println("Run prediction on the test set")
cvModel.transform(Preproessing.testData)
.select("id", "prediction")
.withColumnRenamed("prediction", "loss")
.coalesce(1) // to get all the predictions in a single csv file
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("output/result_RF.csv")
Also, similar to LR, you can stop the Spark session by invoking the stop() method. Now the generated result_RF.csv file should contain the loss against each ID, that is, claim.