Chapter 9: Scaling Your Training Jobs
In the four previous chapters, you learned how to train models with built-in algorithms, frameworks, or your own code.
In this chapter, you'll learn how to scale training jobs, allowing them to train on larger datasets while keeping training time and cost under control. We'll start by discussing when and how to take scaling decisions, thanks to monitoring information and simple guidelines. You'll also see how to collect profiling information with Amazon SageMaker Debugger, in order to understand how efficient your training jobs are. Then, we'll look at several key techniques for scaling: pipe mode, distributed training, data parallelism, and model parallelism. After that, we'll launch a large training job on the large ImageNet dataset and see how to scale it. Finally, we'll discuss storage alternatives to S3 for large-scale training, namely Amazon EFS and Amazon FSx for Lustre.
We'll cover the following...