You're reading from In-Memory Analytics with Apache Arrow Accelerate data analytics for efficient processing of flat and hierarchical data structures

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781835461228

Length 406 pages

Edition 2nd Edition

Languages

Python

Tools

Apache arrow

Concepts

Data Engineering

Author (1):

Matthew Topol

View More author details

Table of Contents (18) Chapters

Preface

1. Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals FREE CHAPTER

2. Chapter 1: Getting Started with Apache Arrow

3. Chapter 2: Working with Key Arrow Specifications

4. Chapter 3: Format and Memory Handling

5. Part 2: Interoperability with Arrow: The Power of Open Standards

6. Chapter 4: Crossing the Language Barrier with the Arrow C Data API

7. Chapter 5: Acero: A Streaming Arrow Execution Engine

8. Chapter 6: Using the Arrow Datasets API

9. Chapter 7: Exploring Apache Arrow Flight RPC

10. Chapter 8: Understanding Arrow Database Connectivity (ADBC)

11. Chapter 9: Using Arrow with Machine Learning Workflows

12. Part 3: Real-World Examples, Use Cases, and Future Development

13. Chapter 10: Powered by Apache Arrow

14. Chapter 11: How to Leave Your Mark on Arrow

15. Chapter 12: Future Development and Plans

16. Index

Why subscribe?

17. Other Books You May Enjoy

What this book covers

Chapter 1, Getting Started with Apache Arrow, introduces you to the basic concepts underpinning Apache Arrow. It introduces and explains the Arrow format and the data types it supports, along with how they are represented in memory. Afterward, you’ll set up your development environment and run some simple code examples showing the basic operation of Arrow libraries.

Chapter 2, Working with Key Arrow Specifications, continues your introduction to Apache Arrow by explaining how to read both local and remote data files using different formats. You’ll learn how to integrate Arrow with the Python pandas and Polars libraries and how to utilize the zero-copy aspects of Arrow to share memory for performance.

Chapter 3, Format and Memory Handling, discusses the relationships between Apache Arrow and Apache Parquet, Feather, Protocol Buffers, JSON, and CSV data, along with when and why to use these different formats. Following this, the Arrow IPC format is introduced and described, along with an explanation of using memory mapping to further improve performance. Finally, we wrap up with some basic leveraging of Arrow on a GPU.

Chapter 4, Crossing the Language Barrier with the Arrow C Data API, introduces the titular C Data API for efficiently passing Apache Arrow data between different language runtimes and devices. This chapter will cover the struct definitions utilized for this interface along with describing use cases that make it beneficial.

Chapter 5, Acero: A Streaming Arrow Execution Engine, describes how to utilize the reference implementation of an Arrow computation engine named Acero. You’ll learn when and why you should use the compute engine to perform analytics rather than implementing something yourself and why we’re seeing Arrow showing up in many popular execution engines.

Chapter 6, Using the Arrow Datasets API, demonstrates querying, filtering, and otherwise interacting with multi-file datasets that can potentially be across multiple sources. Partitioned datasets are also covered, along with utilizing Acero to perform streaming filtering and other operations on the data.

Chapter 7, Exploring Apache Arrow Flight RPC, examines the Flight RPC protocol and its benefits. You will be walked through building a simple Flight server and client in multiple languages to produce and consume tabular data.

Chapter 8, Understanding Arrow Database Connectivity (ADBC), introduces and explains an Apache Arrow-based alternative to ODBC/JDBC and why it matters for the ecosystem. You will be walked through several examples with sample code that interact with multiple database systems such as DuckDB and PostgreSQL.

Chapter 9, Using Arrow with Machine Learning Workflows, integrates multiple concepts that have been covered to explain the various ways that Apache Arrow can be utilized to improve parts of data pipelines and the performance of machine learning model training. It will describe how Arrow’s interoperability and defined standards make it ideal for use with Spark, GPU compute, and many other tools.

Chapter 10, Powered by Apache Arrow, provides a few examples of current real-world usage of Apache Arrow, such as Dremio, Spice.AI, and InfluxDB.

Chapter 11, How to Leave Your Mark on Arrow, provides a brief introduction to contributing to open source projects in general, but specifically how to contribute to the Arrow project itself. You will be walked through finding starter issues, setting up your first pull request to contribute, and what to expect when doing so. To that end, this chapter also contains various instructions on locally building Arrow C++, Python, and Go libraries from source to test your contributions.

Chapter 12, Future Development and Plans, wraps up the book by examining the features that are still in development at the time of writing. This includes geospatial integrations with GeoArrow and GeoParquet along with expanding Arrow Database Connectivity (ADBC) adoption. Finally, there are some parting words and a challenge from me to you.