Training with the SageMaker data and model parallel libraries
These two libraries were introduced in late 2020, and significantly improve the performance of large-scale training jobs.
The SageMaker Distributed Data Parallel (DDP) library implements a very efficient distribution of computation on GPU clusters. It optimizes network communication by eliminating inter-GPU communication, maximizing the amount of time and resources they spend on training. You can learn more at the following link:
DDP is available for TensorFlow, PyTorch, and Hugging Face. The first two require minor modifications to the training code, but the last one doesn't. As DDP only makes sense for large, long-running training jobs, available instance sizes are ml.p3.16xlarge
, ml.p3dn24dnxlarge
, and ml.p4d.24xlarge
.
The SageMaker Distributed Model Parallel (DMP) library solves a different...