Running batch jobs with Amazon SageMaker Processing
As discussed in the previous section, datasets usually need quite a bit of work to be ready for training. Once training is complete, you may also want to run additional jobs to post-process the predicted data and to evaluate your model on different datasets.
Once the experimentation phase is complete, it's good practice to start automating all these jobs, so that you can run them on demand with little effort.
Discovering the Amazon SageMaker Processing API
The Amazon SageMaker Processing API is part of the SageMaker SDK, which we installed in Chapter 1, Introducing Amazon SageMaker.
SageMaker Processing jobs run inside Docker containers:
- A built-in container for scikit-learn (https://scikit-learn.org)
- A built-in container for PySpark (https://spark.apache.org/docs/latest/api/python/), which supports distributed training
- Your own custom container
Logs are available in Amazon CloudWatch Logs...