Querying multifile datasets
To facilitate the very quick querying of data, modern datasets are often partitioned into multiple files across multiple directories. Many engines and utilities take advantage of this or read and write data in this format, such as Apache Hive, Dremio Sonar, Presto, and many AWS services. The Arrow Datasets library provides functionality as a library for working with these sorts of tabular datasets, such as the following:
- Providing a single, unified interface that supports different data formats and filesystems. As of version 17.0.0 of Arrow, this includes Parquet, ORC, Feather (or Arrow IPC), JSON, and CSV files that are either local or stored in the cloud, such as S3 or HDFS.
- Discovering sources by crawling partitioned directories and providing some simple normalizing of schemas between different files.
- Predicate pushdown for filtering rows efficiently along with optimized column projection and parallel reading.
Using the trusty NYC...