Chapter 4: Bottlenecks and Solutions
Using the code we designed in Chapter 3, Building a Data Parallel Training and Serving Pipeline, we can build data parallel training and serving pipelines using either the parameter server or the All-Reduce paradigm. Similar to what we did in Chapter 3, Building a Data Parallel Training and Serving Pipeline, in this chapter, we will focus on the more widely used All-Reduce paradigm.
In this chapter, we will discuss the shortcomings in the current data parallel training and serving pipelines. For practical system bottleneck discussions, we will make the following assumptions:
- We use homogenous accelerators for all our model training nodes.
- Compared to CPU memory (that is, main memory), the on-device memory for each accelerator is limited.
- In multi-GPU, multi-machine cases, the cross-machine network bandwidth is significantly lower than the communication bandwidth among GPUs within a single machine.
- The training job is exclusively...