Distributed computing
DL models have to be trained on a large amount of data to improve their performance. However, training a deep network with millions of parameters may take days, or even weeks. In Large Scale Distributed Deep Networks, Dean et al. proposed two paradigms, namely model parallelism and data parallelism, which allow us to train and serve a network model on multiple physical machines. In the following section, we introduce these paradigms with a focus on distributed TensorFlow capabilities.
Model parallelism
Model parallelism gives every processor the same data but applies a different model to it. If the network model is too big to fit into one machine's memory, different parts of the model can be assigned to different machines. A possible model parallelism approach is to have the first layer on a machine (node 1), the second layer on the second machine (node 2), and so on. Sometimes this is not the optimal approach, because the last layer has to wait for the first layer's...