Scaling up and out – Multi-machine, multi-GPU training
To reach the highest scale in terms of the distributed training of deep learning-based models, we need the capability to leverage compute resources across GPUs and across machines. This can significantly reduce the time it takes to iterate over or develop new models and architectures for the problem you are trying to solve. With easy access to cloud computing services such as Microsoft Azure, Amazon AWS, and Google’s GCP, renting multiple GPU-equipped machines for an hourly rate has become easier and much more common. It is also more economical than setting up and maintaining your own multi-GPU multi-machine node. This recipe will provide a quick walk-through of training deep models using TensorFlow 2.x’s multi-worker mirrored distributed execution strategy based on the official documentation, which you can use and easily customize for your use cases. For the multi-machine, multi-GPU distributed training example...