Distributed training on Amazon SageMaker
In the last chapter, we learned about SageMaker generally. Now, I’d like to dive into distributed training capabilities. We can break these up into four different categories: containers, orchestration, usability, and performance at scale.
As we learned in an earlier chapter, AWS offers deep learning (DL) containers that you can easily point to for your own scripts and code. These are strongly recommended as the first starting point for your project because all of the frameworks, versions, and libraries have been tested and integrated for you. This means that you can simply pick a container based on whichever DL framework you are using—for example, PyTorch or TensorFlow—and this container has already been tested on AWS and SageMaker. You can also select the GPU version of this container, and it will already have all of the NVIDIA libraries compiled and installed to run nicely on your GPUs. If you have your own container...