Using Petastorm for distributed learning
Petastorm is an open source library that allows us to do single or distributed training of machine and deep learning algorithms using datasets stored as Apache Parquet files. It supports popular frameworks such as PyTorch, TensorFlow, and PySpark and can also be used for other Python applications. Petastorm provides us with a simple function to augment the functionality of the Parquet format with Petastorm-specific data to be able to be used in machine and deep learning model training. We can simply read our data by creating a reader object from Databricks File System and iterating over it. The underlying Petastorm library uses the PyArrow library to read Parquet files.
In this section, we will discuss how we can use Petastorm to further extend the performance of our machine and deep learning training pipelines in Azure Databricks.
Introducing Petastorm
As mentioned before, Petastorm is an open source library that enables a single machine...