Large-Scale Training on SageMaker
In this chapter, we will cover the key features and functionality available with Amazon SageMaker to run highly optimized distributed training. You’ll learn how to optimize your script for SageMaker training, along with key usability features. You’ll also learn about backend optimizations for distributed training with SageMaker, such as GPU health checks, resilient training, checkpointing, and script mode.
We are going to cover the following topics in this chapter:
- Optimizing your script for SageMaker training
- Top usability features for SageMaker training