Now that we have the initial raw dataset, we are going to shuffle it, split it into a training and a held-out subset, and load it to an S3 bucket.
Preparing the data
Splitting the data
As we saw in Chapter 2, Machine Learning Definitions and Concepts, in order to build and select the best model, we need to split the dataset into three parts: training, validation, and test, with the usual ratios being 60%, 20%, and 20%. The training and validation sets are used to build several models and select the best one while the held-out set is used for the final performance evaluation on previously unseen data. We will use the held-out subset in Chapter 6, Predictions and Performances to simulate batch predictions with the model we build in Chapter...