Notes on intra-layer model parallelism
Here, we will discuss some more details of intra-layer model parallelism.
Intra-layer model parallelism is a good way to split giant NLP models. This is because it allows model partitioning within a layer and without introducing significant communication overhead during forward and backward propagation. Basically, for one split, it may only introduce one All-Reduce function in either forward or backward propagation, which is acceptable.
In addition, intra-layer model parallelism can also be easily adopted together with data parallelism training. If we have a multi-machine, multi-GPU system, we can do intra-layer parallelism within a machine. This is because GPUs within a machine often have high communication bandwidth. We can also do data parallelism training across different machines.
Finally, we generally believe intra-layer model parallelism is mostly applicable to NLP models. In other words, for convolutional neural network (CNN...