Preparing the data
Now that we have the initial raw dataset, we are going to shuffle it, split it into a training and a held-out subset, and load it to an S3 bucket.
Splitting the data
As we saw in Chapter 2, Machine Learning Definitions and Concepts, in order to build and select the best model, we need to split the dataset into three parts: training, validation, and test, with the usual ratios being 60%, 20%, and 20%. The training and validation sets are used to build several models and select the best one while the held-out set is used for the final performance evaluation on previously unseen data. We will use the held-out subset in Chapter 6,Predictions and Performancesto simulate batch predictions with the model we build in Chapter 5, Model Creation.
Since Amazon ML does the job of splitting the dataset used for model training and model evaluation into training and validation subsets, we only need to split our initial dataset into two parts: the global training/evaluation subset (80%) for...