Chapter 3: Building a Data Parallel Training and Serving Pipeline
In the previous chapter, we discussed the two main-stream data parallel training paradigms, parameter server and All-Reduce. Due to the shortcomings of the parameter server paradigm, the mainstream solution for data parallel training is the All-Reduce architecture. We will illustrate our implementation using the All-Reduce paradigm.
In this chapter, we will mainly focus on the coding side of data parallelism. Before we dive into the details, we will list the assumptions we have for the implementations in this chapter:
- We will use homogenous hardware for all our training nodes.
- All our training nodes will be exclusively used for a single job, which means no resource sharing in multi-tenant clusters.
- The number of accelerators will always be sufficient for our needs.
First, we will describe the entire training pipeline and highlight the major components, which include data preprocessing, data...