Training a random forest model on Spark
In this section, we will explore and preprocess the historical taxi trip data and train and evaluate a random forest model for taxi demand prediction on Spark. We will introduce these steps in the following subsections:
- Exploring the seasonalities via line plots and auto-correlation plots
- Preprocessing the data
- Training and testing the Spark random forest model
The steps in the application are also depicted in the training workflow in Figure 12.6 (accessible on the KNIME Hub under https://kni.me/w/13wY0Bz-2wUAxffc):
Figure 12.6 – The workflow training a Spark random forest model for demand prediction
The first part of the workflow loads the Parquet files onto Spark as introduced in the Accessing the data and loading it into Spark subsection. The downstream parts of the workflow – data exploration, preprocessing, model training and testing, and model evaluation – are introduced...