You're reading from In-Memory Analytics with Apache Arrow Accelerate data analytics for efficient processing of flat and hierarchical data structures

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781835461228

Length 406 pages

Edition 2nd Edition

Languages

Python

Tools

Apache arrow

Concepts

Data Engineering

Author (1):

Matthew Topol

View More author details

Table of Contents (18) Chapters

Preface

1. Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals

2. Chapter 1: Getting Started with Apache Arrow FREE CHAPTER

3. Chapter 2: Working with Key Arrow Specifications

4. Chapter 3: Format and Memory Handling

5. Part 2: Interoperability with Arrow: The Power of Open Standards

6. Chapter 4: Crossing the Language Barrier with the Arrow C Data API

7. Chapter 5: Acero: A Streaming Arrow Execution Engine

8. Chapter 6: Using the Arrow Datasets API

9. Chapter 7: Exploring Apache Arrow Flight RPC

10. Chapter 8: Understanding Arrow Database Connectivity (ADBC)

11. Chapter 9: Using Arrow with Machine Learning Workflows

12. Part 3: Real-World Examples, Use Cases, and Future Development

13. Chapter 10: Powered by Apache Arrow

14. Chapter 11: How to Leave Your Mark on Arrow

15. Chapter 12: Future Development and Plans

16. Index

Why subscribe?

17. Other Books You May Enjoy

Using the Arrow Datasets API

In the current ecosystem of data lakes and lakehouses, many datasets are now huge collections of files in partitioned directory structures rather than a single file. To facilitate this workflow, the Arrow libraries provide an API for easily interacting with these types of structured and unstructured data. This is called the Datasets API and is designed to perform a lot of the heavy lifting by querying these types of datasets for you.

The Datasets API provides a series of utilities for easily interacting with large, distributed, and possibly partitioned datasets that are spread across multiple files. It also leverages the Compute APIs and integrates very easily with Acero, which we covered previously in Chapter 5, Acero: A Streaming Arrow Execution Engine.

In this chapter, we will learn how to use the Arrow Datasets API for efficient querying of multifile, tabular datasets regardless of their location or format. We will also learn how to use the dataset...

The rest of the chapter is locked

You're reading from In-Memory Analytics with Apache Arrow Accelerate data analytics for efficient processing of flat and hierarchical data structures

Table of Contents (18) Chapters

Using the Arrow Datasets API

Authors (1)

Personalised recommendations for you

You're reading from In-Memory Analytics with Apache Arrow Accelerate data analytics for efficient processing of flat and hierarchical data structures

Table of Contents (18) Chapters

Using the Arrow Datasets API

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you