Data preparation at scale with SageMaker Processing
Now let's turn our attention to preparing the entire dataset. At 500 GB, it's too large to process using sklearn
on a single EC2 instance. We will write a SageMaker processing job that uses Spark ML for data preparation. (Alternatively, you can use Dask, but at the time of writing, SageMaker Processing does not provide a Dask container out of the box.)
The Processing Job
part of this chapter's notebook walks you through launching the processing job. Note that we'll use a cluster of 15 EC2 instances to run the job (if you need limits raised, you can contact AWS support).
Also note that up until now, we've been working with the uncompressed JSON version of the data. This format containing thousands of small JSON files is not ideal for Spark processing as the Spark executors will spend a lot of time doing I/O. Luckily, the OpenAQ
dataset also includes a gzipped Parquet version of the data. Compression...