We trained a simple model on a very small dataset. When using larger models and datasets, more computing power is necessary—this often implies multiple servers. The tf.distribute.Strategy API defines how multiple machines communicate together to train a model efficiently.
Some of the strategies defined by TensorFlow are as follows:
- MirroredStrategy: For training on multiple GPUs on a single machine. Model weights are kept in sync between each device.
- MultiWorkerMirroredStrategy: Similar to MirroredStategy, but for training on multiple machines.
- ParameterServerStrategy: For training on multiple machines. Instead of syncing the weights on each device, they are kept on a parameter server.
- TPUStrategy: For training on Google's Tensor Processing Unit (TPU) chip.
The TPU is a custom chip made by Google, similar to a GPU, designed specifically to run neural network computations. It is available through Google Cloud.
To use a distribution strategy...