Distributing training jobs
Distributed training lets you scale training jobs by running them on a cluster of CPU or GPU instances. These may train either on the full dataset or on a fraction of it, depending on the distribution policy that we configure. FullyReplicated
distributes the full dataset to each instance. ShardedByS3Key
distributes an equal number of input files to each instance, which is where splitting your dataset into many files comes in handy.
Distributing training for built-in algorithms
Distributed training is available for almost all built-in algorithms. Semantic Segmentation and LDA are notable exceptions.
As built-in algorithms are implemented with Apache MXNet, training instances use its Key-Value Store to exchange results. It's set up automatically by SageMaker on one of the training instances. Curious minds can learn more at https://mxnet.apache.org/api/faq/distributed_training.
Distributing training for built-in frameworks
You can use distributed...