Training a model using Horovod
Even though we introduced Horovod as we introduced SageMaker, Horovod is designed to support distributed training alone (https://horovod.ai/). It aims to provide a simple way to train models in a distributed fashion by providing nice integrations for popular DL frameworks, including TensorFlow and PyTorch.
As mentioned previously in the SageMaker with Horovod section, the core principles of Horovod are based on MPI concepts such as size, rank, local rank, allreduce, allgather, broadcast, and alltoall (https://horovod.readthedocs.io/en/stable/concepts.html).
In this section, we will learn about how to set up a Horovod cluster using EC2 instances. Then, we will describe the modifications you need to make in TF and PyTorch scripts to train your model on the Horovod cluster.
Setting up a Horovod cluster
To set up a Horovod cluster using EC2 instances, you must follow these steps:
- Go to the EC2 instance console: https://console...