Chapter 12: Distributed Deep Learning in Azure Databricks
In the previous chapter, we have learned how we can effectively serialize machine learning pipelines and manage the full development life cycle of machine learning models in Azure Databricks. This chapter will focus on how we can apply distributed training in Azure Databricks.
Distributed training of deep learning models is a technique in which the training process is distributed across workers in clusters of computers. This process is not trivial and its implementation requires us to fine-tune the way in which the workers communicate and transmit data between them, otherwise distributing training can take longer than single-machine training. Azure Databricks Runtime for Machine Learning includes Horovod, a library that allows us to solve most of the issues that arise from distributed training of deep learning algorithms. We will also show how we can leverage the native Spark support of the TensorFlow machine learning framework...