Previously we learned how to run TensorFlow models at scale in production using Kubernetes, Docker and TensorFlow serving. TensorFlow serving is not the only way to run TensorFlow models at scale. TensorFlow provides another mechanism to not only run but also train the models on different nodes and different devices on multiple nodes or the same node. In chapter 1, TensorFlow 101, we also learned how to place variables and operations on different devices. In this chapter, we shall learn how to distribute the TensorFlow models to run on multiple devices across multiple nodes.
In this chapter, we shall cover the following topics:
- Strategies for distributed execution
- TensorFlow clusters
- Data parallel models
- Asynchronous and synchronous updates to distributed models