Distributing training jobs
Distributed training lets you scale training jobs by running them on a cluster of CPU or GPU instances. It can be used to solve two different problems: very large datasets, and very large models.
Understanding data parallelism and model parallelism
Some datasets are too large to be trained in a reasonable amount of time on a single CPU or GPU. Using a technique called data parallelism, we can distribute data across the training cluster. The full model is still loaded on each CPU/GPU, which only receive an equal share of the dataset, not the full dataset. In theory, this should speed up training linearly according to the number of CPU/GPUs involved, and as you can guess, the reality is often different.
Believe it or not, some state-of-the-art-deep learning models are too large to fit on a single GPU. Using a technique called model parallelism, we can split it, and distribute the layers across a cluster of GPUs. Hence, training batches will flow across...