Using the Arrow Datasets API
In the current ecosystem of data lakes and lakehouses, many datasets are now huge collections of files in partitioned directory structures rather than a single file. To facilitate this workflow, the Arrow libraries provide an API for easily interacting with these types of structured and unstructured data. This is called the Datasets API and is designed to perform a lot of the heavy lifting by querying these types of datasets for you.
The Datasets API provides a series of utilities for easily interacting with large, distributed, and possibly partitioned datasets that are spread across multiple files. It also leverages the Compute APIs and integrates very easily with Acero, which we covered previously in Chapter 5, Acero: A Streaming Arrow Execution Engine.
In this chapter, we will learn how to use the Arrow Datasets API for efficient querying of multifile, tabular datasets regardless of their location or format. We will also learn how to use the dataset...