All-Reduce architecture
So far, we have discussed the parameter server architecture, its implementation, and its shortcomings.
Next, we will look at the All-Reduce architecture for data parallel training process.
In the All-Reduce architecture, we abandon the parameter server role in the parameter server architecture. Now, every node is equivalent and all of them are worker nodes.
This all-worker methodology directly addresses the two main shortcomings in the parameter server architecture:
- First, since we only have workers, given N nodes, we do not need to determine the ratio between the parameter servers and workers. We just treat all the nodes as workers.
- Second, we only need to define worker objects. Furthermore, we leave the burden of implementing communication protocols to standard collective communication libraries such as
NCCL
andBlink
.
The All-Reduce paradigm is borrowed from the traditional Message Passing Interface (MPI) domain. Before we talk...