Loading data for deep learning
In this chapter, we will learn how we can prepare data for distributed training. To do this, we will learn how to efficiently load data to create deep learning based applications that can leverage the distributed computing nature of Azure Databricks while handling large amounts of data. We will describe two different methods that we have at our disposal for working with large datasets for distributed training. Those methods are Petastorm and TFRecord, which are libraries that make our work easier when loading large and complex datasets to our deep learning algorithms in Azure Databricks.
At a quick glance, the main characteristics of the Petastorm and TFRecord methods are as follows:
- Petastorm: It is an open source library that allows us to directly load data in Apache Parquet format to train our deep learning algorithms. This is a great feature of Azure Databricks because Parquet is a widely used format when working with large amounts of data...