As stated earlier, we will see a systematic example for classifying a large collection of video clips from the UCF101 dataset using a convolutional-LSTM network. However, first we need to know how to distribute the training across multiple GPUs. In previous chapters, we discussed several advanced techniques such as network weight initialization, batch normalization, faster optimizers, proper activation functions, etc. these certainly help the network to converge faster. However, still, training a large neural network on a single machine can take days or even weeks. Therefore, this is not a viable way for working with large-scale datasets.
Theoretically, there are two main methods for the distributed training of neural networks: data parallelism and model parallelism. DL4J relies on data parallelism called distributed deep learning...