Understanding the Arrow format and specifications
According to the Apache Arrow documentation [1]:
Well, that's a lot of technical jargon! Let's start from the top. Apache Arrow (just Arrow for brevity) is an open source project from the Apache Software Foundation that is released under the Apache License, Version 2.0 [2]. It was co-created by Dremio and Wes McKinney, the creator of pandas, and first released in 2016. To simplify, Arrow is a collection of libraries and specifications that make it easy to build high-performance software utilities for processing and transporting large datasets. It consists of a collection of libraries related to in-memory data processing, including specifications for memory layouts and protocols for sharing and efficiently transporting data between systems and processes. When we're talking about in-memory data processing, we are talking exclusively about the processing of data in RAM and eliminating slow data accesses wherever possible to improve performance. This is where Arrow excels and provides libraries to support this with utilities for streaming and transportation in order to speed up data access.
When working with data, there are two primary situations to consider, and each has different needs: the in-memory format and the on-disk format. When data is stored on disk, the biggest concerns are the size of the data and the input/output (I/O) cost to read it into the main memory before you can operate on it. As a result, formats for data on disk tend to be focused much more on increasing I/O throughput, such as compressing the data to make it smaller and faster to read into memory. One example of this might be the Apache Parquet format, which is a columnar on-disk file format. Instead of being an on-disk format, Arrow's focus is the in-memory format case, which targets CPU efficiency as the goal, with numerous tactics such as cache locality and vectorization of computation.
The primary goal of Arrow is to essentially become the lingua franca of data analytics and processing, the One Framework to Rule Them All, so to speak. Different databases, programming languages, and libraries tend to implement and use their own separate internal formats for managing data, which means that any time you are moving data between these components for different uses, you're paying a cost to serialize and deserialize that data every time. Not only that, but lots of time and resources get spent reimplementing common algorithms and processing in those different data formats over and over. If instead, we can standardize on an efficient, feature-rich internal data format that can be widely adopted and used, this excess computation and development time is no longer necessary. Figure 1.1 shows a simplified diagram of multiple systems, each with its own data format, having to be copied and/or converted in order for the different components to work with each other:
In many cases, the serialization and deserialization can end up taking nearly 90% of the processing time in such a system rather than being able to spend that CPU on the analytics. Alternatively, if every component is using Arrow's in-memory format, you end up with a system as in Figure 1.2, where the data can be transferred between components at little-to-no cost. All of the components can either share memory directly or send the data as-is without having to convert between different formats.
At this point, there's no need for the different components and systems to implement custom connectors or re-implement common algorithms and utilities. The same libraries and connectors can be utilized, even across programming languages and process barriers, by sharing memory directly to refer to the same data rather than copying multiple times between them.
Most data processing systems now use distributed processing by breaking the data up into chunks and sending those chunks across the network to various workers, so even if we can share memory across processes on a box, there's still the cost to send it across the network. This brings us to the final piece of the puzzle: the format of raw Arrow data on the wire is the same as it is in memory. You can directly reference the memory buffers used for the network protocols without having to deserialize that data before you can use it, or reference the memory buffers you were operating on to send it across the network without having to serialize it first. Just a bit of metadata sent along with the raw data buffers and interfaces that perform zero-copies can be created in order to achieve performance benefits, by reducing memory usage and improving CPU throughput.
Let's quickly recap the features of the Arrow format we were just describing before moving on:
- Using the same high-performance internal format across components allows much more code reuse in libraries instead of reimplementing common workflows.
- The Arrow libraries provide mechanisms to directly share memory buffers to reduce copying between processes by using the same internal representation regardless of the language. This is what is being referred to whenever you see the term zero-copy.
- The wire format is the same as the in-memory format to eliminate serialization and deserialization costs when sending data across networks between components of a system.
Now, you might be thinking well this sounds too good to be true! and of course, being skeptical of promises like this is always a good idea. The community around Arrow has done a ton of work over the years to bring these ideas and concepts to fruition. The project itself provides and distributes libraries in a variety of different programming languages so that projects that want to incorporate and/or support the Arrow format don't need to implement it themselves. Above and beyond the interaction with Arrow-formatted data, the libraries provide a significant amount of utility in assisting with common processes such as data access and I/O-related optimizations. As a result, the Arrow libraries can be useful for projects, even if they don't actually utilize the Arrow format themselves.
Here's just a quick sample of use cases where using Arrow as the internal/intermediate data format can be very beneficial:
- SQL execution engines (such as Dremio Sonar, Apache Drill, or Impala)
- Data analysis utilities and pipelines (such as pandas or Spark)
- Streaming and message queue systems (such as Apache Kafka or Storm)
- Storage systems and formats (such as Apache Parquet, Cassandra, and Kudu)
As for how Arrow can help you, it depends on which piece of the data puzzle you personally work with. The following are a few different roles that work with data and show how using Arrow could potentially be beneficial; it's by no means a complete list though:
- If you're a data scientist:
- You can utilize Arrow via pandas and NumPy integration to significantly improve the performance of your data manipulations.
- If the tools you use integrate Arrow support, you can gain significant speed-ups to your queries and computations by using Arrow directly yourself to reduce copies and/or serialization costs.
- If you are a data engineer specializing in extract, transform, and load (ETL):
- The higher adoption of Arrow as an internal and externally-facing format can make it easier to integrate with many different utilities.
- By using Arrow, data can be shared between processes and tools with shared memory increasing the tools available to you for building pipelines, regardless of the language you're operating in. You could take data from Python and use it in Spark and then pass it directly to the Java Virtual Machine (JVM) without paying the cost of copying between them.
- If you are a software engineer or ML specialist building computation tools and utilities for data analysis:
- Arrow as an internal format can be used to improve your memory usage and performance by reducing serialization and deserialization between components.
- Understanding how to best utilize the data transfer protocols can improve the ability to parallelize queries and access your data, wherever it might be.
- Because Arrow can be used for any sort of tabular data, it can be integrated into many different areas of data analysis and computation pipelines, and is versatile enough to be beneficial as an internal and data transfer format, regardless of the shape of your data.
Now that you know what Arrow is, let's dig into its design and how it delivers on the aforementioned promises of high-performance analytics, zero-copy sharing, and network communication without serialization costs. First, you'll see why a column-oriented memory representation was chosen for Arrow's internal format. Afterward, in later chapters, we'll cover specific integration points, explicit examples, and transfer protocols.