Chapter 7: Using the Arrow Datasets API
In the current ecosystem of data lakes and lakehouses, many datasets are now huge collections of files in partitioned directory structures rather than a single file. To facilitate this workflow, the Arrow libraries provide an API for easily interacting with these types of structured and unstructured data. This is called the Datasets API and is designed to perform a lot of the heavy lifting for querying these types of datasets for you.
The Datasets API provides a series of utilities for easily interacting with large, distributed, and possibly partitioned datasets that are spread across multiple files. It also integrates very easily with the Compute APIs we covered previously, in Chapter 6, Leveraging the Arrow Compute APIs.
In this chapter, we will learn how to use the Arrow Datasets API for efficient querying of multifile, tabular datasets regardless of their location or format. We will also understand how to use the dataset classes and...