In this step, the data in a DataFrame object is split (using the Spark randomSplit command) into three—a training set (to be used to train a model), a testing set (to be used for model evaluation and testing the assumptions of the model), and a prediction set (used for prediction) and then a record count is printed for each set:
splitted_data=df_data.randomSplit([0.8,0.18,0.02],24)
train_data=splitted_data[0]
test_data=splitted_data[1]
predict_data=splitted_data[2]
print("Number of training records: " + str(train_data.count())) print("Number of testing records : " + str(test_data.count())) print("Number of prediction records : " + str(predict_data.count()))
Executing the preceding commands within the notebook is shown in the following screenshot: