Using the Horovod distributed learning library in Azure Databricks
horovod
is a library for distributed deep learning training. It supports commonly used frameworks such as TensorFlow, Keras, and PyTorch. As mentioned before, it is based on the tensorflow-allreduce
library and implements the ring allreduce
algorithm in order to ease the migration from single-graphics processing unit (GPU) training to parallel-GPU distributed training.
In order to do this, we adapt a single-GPU training script of a deep learning model to use the horovod
library during the training process. Once we have adapted the script, it can run on single or multiple GPUs without changes to the code.
The horovod
library uses a data parallelization strategy by allowing efficient distribution of the training to multiple GPUs in parallel in an optimized way, by implementing the ring allreduce
algorithm to overcome communication limitations.
It is implemented in a way that each GPU gets a mini-batch of data...