Summary
In this chapter, we walked through how to train deep learning models using SageMaker distributed training libraries: data parallel and model parallel. We ran a TensorFlow example to show how you can modify a script to use SageMaker's distributed data parallel library with eight GPU devices, instead of one from what we learned previously. This enables us to increase the batch size and reduce the iterations needed to go over the entire dataset in an epoch, improving the model training runtime. We then showed how you can adapt SageMaker's distributed model parallel library to model training written in PyTorch. This enables us to train a much larger neural network model by partitioning the large model to all GPU devices. We further showed you how you can easily monitor the compute resource utilization in a training job using SageMaker Debugger and visualize the metrics in the SageMaker Debugger insights dashboard. Lastly, we explained how to adapt your training script...