Summary
In this chapter, you learned how and when to scale training jobs. You saw that it definitely takes some careful analysis and experimentation to find the best setup: scaling up versus scaling out, CPU versus GPU versus multi-GPU, and so on. This should help you to make the right decisions for your own workloads and avoid costly mistakes.
You also learned how to achieve significant speedup with techniques such as distributed training, RecordIO, sharding, and pipe mode. Finally, you learned how to set Amazon EFS and Amazon FSx for Lustre for large-scale training jobs.
In the next chapter, we'll cover advanced features for hyperparameter optimization, cost optimization, model debugging, and more.