Like in the previous chapter, we need to prepare the training and validation data. In this case, we'll reuse the Spark API to split the data:
val trainValidSplits = inputData.randomSplit(Array(0.8, 0.2))
val (trainData, validData) = (trainValidSplits(0), trainValidSplits(1))
Now, let's perform a grid search using a simple decision tree and a few hyperparameters:
val gridSearch =
for (
hpImpurity <- Array("entropy", "gini");
hpDepth <- Array(5, 20);
hpBins <- Array(10, 50))
yield {
println(s"Building model with: impurity=${hpImpurity}, depth=${hpDepth}, bins=${hpBins}")
val model = new DecisionTreeClassifier()
.setFeaturesCol("reviewVector")
.setLabelCol("label")
.setImpurity(hpImpurity)
.setMaxDepth(hpDepth)
.setMaxBins(hpBins)
.fit...