As described in Chapter 1, Analyzing Insurance Severity Claim, Random Forest is an ensemble technique that takes a subset of observations and a subset of variables to build decision trees—that is, an ensemble of DTs. More technically, it builds several decision trees and integrates them together to get a more accurate and stable prediction.
Figure 7: Random forest and its assembling technique explained
This is a direct consequence, since by maximum voting from a panel of independent juries, we get the final prediction better than the best jury (see the preceding figure). Now that we already know the working principle of RF, let's start using the Spark-based implementation of RF. Let's start by importing the required packages and libraries:
import org.apache.spark._
import org.apache.spark.sql.SparkSession
import org.apache...