Converting CSV data into protobuf recordIO format
In this recipe, we will convert and serialize the synthetic data stored in CSV format into the protobuf recordIO
format. With the data serialized into the protobuf recordIO
format, we can take advantage of Pipe mode, where training start times will be faster as the training job streams data directly from the S3 bucket source. That said, the SageMaker algorithms may perform much better with this training file format.
Getting ready
This recipe continues from Generating a synthetic dataset for analysis and transformation.
How to do it…
In the first few steps of this recipe, we will focus on scaling and transforming the synthetic labeled dataset into a set of values between 0
and 1
using MinMaxScaler
from sklearn
:
- Navigate to the
my-experiments/chapter04
directory inside your SageMaker notebook instance. Feel free to create this directory if it does not exist yet. - Create a new notebook using the
conda_python3...