In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures , Second Edition

Matthew Topol

$35.98 ~~$39.99~~

5 (6 Ratings)

eBook Sep 2024 406 pages 2nd Edition

Matthew Topol

$35.98 ~~$39.99~~

5 (6 Ratings)

eBook Sep 2024 406 pages 2nd Edition

What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

AI Assistant (beta) to help accelerate your learning

View table of contents

Preview Book

Download Code

Key benefits

Explore Apache Arrow's data types and integration with pandas, Polars, and Parquet
Work with Arrow libraries such as Flight SQL, Acero compute engine, and Dataset APIs for tabular data
Enhance and accelerate machine learning data pipelines using Apache Arrow and its subprojects
Purchase of the print or Kindle book includes a free PDF eBook

Description

Apache Arrow is an open source, columnar in-memory data format designed for efficient data processing and analytics. This book harnesses the author’s 15 years of experience to show you a standardized way to work with tabular data across various programming languages and environments, enabling high-performance data processing and exchange. This updated second edition gives you an overview of the Arrow format, highlighting its versatility and benefits through real-world use cases. It guides you through enhancing data science workflows, optimizing performance with Apache Parquet and Spark, and ensuring seamless data translation. You’ll explore data interchange and storage formats, and Arrow's relationships with Parquet, Protocol Buffers, FlatBuffers, JSON, and CSV. You’ll also discover Apache Arrow subprojects, including Flight, SQL, Database Connectivity, and nanoarrow. You’ll learn to streamline machine learning workflows, use Arrow Dataset APIs, and integrate with popular analytical data systems such as Snowflake, Dremio, and DuckDB. The latter chapters provide real-world examples and case studies of products powered by Apache Arrow, providing practical insights into its applications. By the end of this book, you’ll have all the building blocks to create efficient and powerful analytical services and utilities with Apache Arrow.

Who is this book for?

This book is for developers, data engineers, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. Whether you’re building utilities for data analytics and query engines, or building full pipelines with tabular data, this book can help you out regardless of your preferred programming language. A basic understanding of data analysis concepts is needed, but not necessary. Code examples are provided using C++, Python, and Go throughout the book.

What you will learn

Use Apache Arrow libraries to access data files, both locally and in the cloud
Understand the zero-copy elements of the Apache Arrow format
Improve the read performance of data pipelines by memory-mapping Arrow files
Produce and consume Apache Arrow data efficiently by sharing memory with the C API
Leverage the Arrow compute engine, Acero, to perform complex operations
Create Arrow Flight servers and clients for transferring data quickly
Build the Arrow libraries locally and contribute to the community

What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

AI Assistant (beta) to help accelerate your learning

Frequently bought together

Database Design and Modeling with PostgreSQL and MySQL

$34.99

$39.99

$49.99

Total $ 124.97

Filter reviews by

All

Amazon verified reviews

Will Ayd Oct 15, 2024

I've worked in the open source analytics space for a while, and have always had a vague understanding of what the Apache Arrow project is. However, trying to piece the bigger picture together from the official documentation is challenging and leaves a lot to be desired.This book helped me understand core Arrow technologies like Acero, Flight, Flight SQL, and ADBC at a much deeper level. The detailed description of the Arrow array format is a great resource for developers, and the fact that examples are provided in C++, Python, and Go makes it easy to put the theory of Arrow into action into a language of your choice.I highly recommend this book to data engineers. Whether they are looking to build Arrow-based systems or just want to understand the technology better, this book is must-read.

Amazon Verified review

Martin Kysel Oct 03, 2024

I'm someone who prefers structured learning over trial and error. When exploring new technology, I always seek out resources like this book that not only cover what I might discover on my own, but also guide me through the unknowns I wouldn't have thought to search for. I had heard of the Apache Arrow ecosystem but knew very little about it. This book has been incredibly helpful in navigating its complexities, including ADBC, Parquet, Flight, and much more. I am likely to come back to this book and use it as a reference in my future dealings with Apache Arrow.

Nic Sep 30, 2024

Matt does a great job of defining complex terminology in a way that makes the ideas accessible to readers. His tone is technical but conversational which makes this book enjoyable to read despite the highly technical nature of the content. I like the way the book covers both PyArrow and C++ so we can really see how Arrow is a standard with multiple implementations. I am a developer on the Apache Arrow project and have used a previous edition of this book to get to grips with some complex ideas when I found the project docs a bit too dense. Thanks Matt!

Nicolay Gerold Sep 30, 2024

I am atm in process of developing an open source tool on top of arrow. This book is what got me started and helped me understand how Arrow works under the hood and how to work with it. Now also with a section on ML!!!!

JT Oct 12, 2024

As Arrow moves from up-and-coming to the de facto standard in data formats for in-memory and over-the-wire tabular data, the second, revised edition of Matt's book is a necessary read for anybody in data engineering, analytics and data science.The book serves both as a technical, hands-on introduction to Arrow and its rich ecosystem - PyArrow, Flight, Acero etc. - and as a reference for intermediate-to-advanced use cases - FFI, Arrow IPC, etc. The book guides the reader through the motivations behind the design choices, and explain how to effectively solve common data management problems leveraging Arrow strengths (code snippets are also available on GitHub!).As an entrepreneur building in the Arrow ecosystem, this is the book I give to every new hire to quickly build an accurate mental model of data best practices.

In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures , Second Edition

What do you get with eBook?