Issues with the parameter server
In recent years, fewer and fewer machine learning practitioners have been using the parameter server paradigm for their data parallel training jobs. The main reason for this decrease in the popularity of the parameter server architecture is twofold.
Given N nodes, it is unclear what the best ratio is between the parameter server and workers.
As we've mentioned previously, in the parameter server architecture, we have two roles:
- Parameter server:
- Never do training, 0 training BW
- More PS, higher communication BW, less model synchronization latency
- Worker:
- More Workers, higher training BW
- More Workers, more data transfer, higher model synchronization overhead
We need to balance training throughput and communication latency. We will discuss this trade-off in the following two cases.
Case 1 – more parameter servers
If we assign more nodes as parameter servers, we have fewer data to communicate since we have fewer...