Summary
In this chapter, we discussed two major bottlenecks in the data parallel training process – communication and on-device memory.
Communication becomes a bottleneck during model synchronization. To make things even worse, the Ring All-Reduce solution also wastes some network links that cannot form a ring. Thus, we propose a tree-based All-Reduce solution, which is more efficient and can achieve faster model synchronization than ring-based solutions.
To mitigate the issue of memory, we discussed two major methods – recomputation and quantization.
In the next chapter, we will explore model parallelism, which is another kind of popular paradigm for in-parallel model training and inference. Instead of splitting the input data, model parallelism partitions the model itself.