You're reading from In-Memory Analytics with Apache Arrow Accelerate data analytics for efficient processing of flat and hierarchical data structures

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781835461228

Length 406 pages

Edition 2nd Edition

Languages

Python

Tools

Apache arrow

Concepts

Data Engineering

Author (1):

Matthew Topol

View More author details

Table of Contents (18) Chapters

Preface

1. Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals FREE CHAPTER

2. Chapter 1: Getting Started with Apache Arrow

3. Chapter 2: Working with Key Arrow Specifications

4. Chapter 3: Format and Memory Handling

5. Part 2: Interoperability with Arrow: The Power of Open Standards

6. Chapter 4: Crossing the Language Barrier with the Arrow C Data API

7. Chapter 5: Acero: A Streaming Arrow Execution Engine

8. Chapter 6: Using the Arrow Datasets API

9. Chapter 7: Exploring Apache Arrow Flight RPC

10. Chapter 8: Understanding Arrow Database Connectivity (ADBC)

11. Chapter 9: Using Arrow with Machine Learning Workflows

12. Part 3: Real-World Examples, Use Cases, and Future Development

13. Chapter 10: Powered by Apache Arrow

14. Chapter 11: How to Leave Your Mark on Arrow

15. Chapter 12: Future Development and Plans

16. Index

Why subscribe?

17. Other Books You May Enjoy

What about non-CPU device data?

Toward the end of the last chapter, Chapter 3, Format and Memory Handling, I brought up the topic of utilizing Arrow with GPUs and other non-CPU devices. This is an increasingly important topic as pre-processing analytical workflows try to keep up with the demands of providing the data that machine learning models need. There are several different libraries that are commonly utilized for GPU-based analytics by data scientists. The following are just a few examples:

Numba: An open source Just-In-Time (JIT) compiler to translate a subset of Python and NumPy into low-level machine code with options to parallelize Python code on CPUs and GPUs.
XGBoost: An open source library providing optimized distributed gradient boosting algorithms that also run on GPUs.
PyTorch: An open source machine learning library typically used for computer vision and natural language processing, which also supports running on NVIDIA GPUs for performance enhancement...