Chapter 9: Scaling Your Training Jobs
In the four previous chapters, you learned how to train models with built-in algorithms, frameworks, or your own code.
In this chapter, you'll learn how to scale training jobs, allowing them to train on larger datasets while keeping the training time and cost under control. We'll start by discussing when and how to take scaling decisions, thanks to monitoring information and simple guidelines. Then, we'll look at pipe mode and distributed training, two key techniques for scaling. We'll also discuss storage alternatives to S3 for large-scale training. Finally, we'll launch a large training job on the ImageNet dataset.
We'll cover the following topics:
- Understanding when and how to scale
- Streaming datasets with pipe mode
- Distributing training jobs
- Using other storage services
- Training an image classification model on ImageNet