Multi-GPU
Cifar and MNIST images are still small, below 35x35 pixels. Training on natural images requires the preservation of details in the images. So, for example, a good input size is 224x224, which is 40 times more. When image classification nets with such input size have a few hundred layers, GPU memory limits the batch size to a dozen images and so training a batch takes a long time.
To work in multi-GPU mode:
The model parameters are in a shared variable, meaning shared between CPU / GPU 1 / GPU 2 / GPU 3 / GPU 4, as in single GPU mode.
The batch is divided into four splits, and each split is sent to a different GPU for the computation. The network output is computed on the split, and the gradients retro-propagated to each weight. The GPU returns the gradient values for each weight.
The gradients for each weight are fetched back from the multiple GPU to the CPU and stacked together. The stacked gradients represent the gradient of the full initial batch.
The update rule applies to the batch...