For distributing the training of the single model across multiple devices or nodes, there are the following strategies:
- Model Parallel: Divide the model into multiple subgraphs and place the separate graphs on different nodes or devices. The subgraphs perform their computation and exchange the variables as required.
- Data Parallel: Divide the data into batches and run the same model on multiple nodes or devices, combining the parameters on a master node. Thus the worker nodes train the model on batches of data and send the parameter updates to the master node, also known as the parameter server.
The preceding diagram shows the data parallel approach where the model replicas read the partitions of data in batches and send the parameter updates to the parameter servers, and parameter servers send the updated parameters back to the model replicas...