ML training at scale with SageMaker distributed libraries
Two common scale challenges with ML projects are scaling training data and scaling model size. While increased training data volume, model size, and complexity can potentially result in a more accurate model, there is a limit to the data volume and the model size that you can use with a single compute node, CPU, or GPU. Increased training data volumes and model sizes typically result in more computations, and therefore training jobs take longer to finish, even when using powerful compute instances such as Amazon Elastic Compute Cloud (EC2) p3
and p4
instances.
Distributed training is a commonly used technique to speed up training when dealing with scale challenges. Training load can be distributed either across multiple compute instances (nodes), or across multiple CPUs and GPUs (devices) on a single compute instance. There are two strategies for distributed training – data parallelism and...