Why does Arrow use a columnar in-memory format?
Most traditional data processing of tabular data will have its own custom data structures for representing and managing those datasets in memory while processing them, such as query engines and data services, for example. Of course, if there are custom data structures, this means it requires developing custom serialization protocols between file formats, network protocols, libraries, and any other interface you could think of. I can vouch from experience that the result is a huge amount of developer time and CPU cycles being wasted dealing with these various serialization schemes, rather than being able to spend it all on the analytical workloads. One goal of the Arrow project is for fewer systems to have to create their own data structures and utilize Arrow as their internal format. Doing so would allow those components to expose Arrow directly as a wire format and benefit from not having to pay a serialization or deserialization cost to pass the data around.
There is often a lot of debate surrounding whether a database should be row-oriented or column-oriented, but this primarily refers to the on-disk format of the underlying storage files. Arrow's data format is different from most cases discussed so far since it uses a columnar organization of data structures in memory directly. If you're not familiar with columnar as a term, let's take a look at what exactly it means. First, imagine the following table of data:
Traditionally, if you were to read this table into memory, you'd likely have some structure to represent a row and then read the data in one row at a time. Maybe something like struct { string archer; string location; int year }
. The result is that you have the memory grouped closely together for each row, which is great if you always want to read all the columns for every row. But, if this were a much bigger table, and you just wanted to find out the minimum and maximum years or any other column-wise analytics such as the unique locations, you would have to read the whole table into memory and then jump around in memory, skipping the fields you didn't care about so that you could read the value for each row of one column.
Most operating systems, while reading data into main memory and CPU caches, will attempt to make predictions about what memory it is going to need next. In our example table of archers, consider how many memory pages of data would have to be accessible and traversed to get a list of unique locations if the data were organized in row or column orientations:
A columnar format keeps the data organized by column instead of by row, as shown in the preceding figure. As a result, operations such as grouping, filtering, or aggregations based on column values become much more efficient to perform since the entire column is already contiguous in memory. Considering memory pages again, it's plain to see that for a large table, there would be significantly more pages that need to be traversed to get a list of unique locations from a row-oriented buffer than a columnar one. Fewer page faults and more cache hits mean increased performance and a happier CPU. Computational routines and query engines tend to operate on subsets of the columns for a dataset rather than needing every column for a given computation, making it significantly more efficient to operate on columnar data.
If you look closely at the construction of the column-oriented data buffer on the right side of Figure 1.4, you can see how it benefits the queries I mentioned earlier. If we wanted all the archers that are in Europe, we can easily scan through just the location column and discover which rows are the ones we want, and then spin through just the archer block and grab only the rows that correspond to the row indexes we found. This will come into play again when we start looking at the physical memory layout of Arrow arrays; since the data is column-oriented, it makes it easier for the CPU to predict instructions to execute and maintains this memory locality between instructions.
By keeping the column data contiguous in memory, it enables vectorization of the computations. Most modern processors have single instruction, multiple data (SIMD) instructions available that can be taken advantage of for speeding up computations and require having the data in a contiguous block of memory to operate on it. This concept can be found heavily utilized by graphics cards, and in fact, Arrow provides libraries to take advantage of Graphics Processing Units (GPUs) precisely because of this. Consider the example where you might want to multiply every element of a list by a static value, such as performing a currency conversion on a column of prices with an exchange rate:
From the figure, you can see the following:
- The left side of the figure shows that an ordinary CPU performing the computation in a non-vectorized fashion requires loading each value into a register, multiplying it with the exchange rate, and then saving the result back into RAM.
- On the right side of the figure, we see that vectorized computation, such as using SIMD, performs the same operation on multiple different inputs at the same time, enabling a single load to multiply and save to get the result for the entire group of prices. Being able to vectorize a computation has various constraints; often, one of those constraints is requiring the data being operated on to be in a contiguous chunk of memory, which is why columnar data is much easier to do this with.
SIMD versus Multithreading
If you're not familiar with SIMD, you may wonder how it differs from another parallelization technique: multithreading. Multithreading operates at a higher conceptual level than SIMD. Each thread has its own set of registers and memory space representing its execution context. These contexts could be spread across separate CPU cores or possibly interleaved by a single CPU core switching whenever it needs to wait for I/O. SIMD is a processor-level concept that refers to the specific instructions being executed. Put simply, multithreading is multitasking and SIMD is doing less work to achieve the same result.
Another benefit of utilizing column-oriented data comes into play when considering compression techniques. At some point, your data will become large enough that sending it across the network could become a bottleneck, purely due to size and bandwidth. With the data being grouped together in columns that are all the same type as contiguous memory, we end up with significantly better compression ratios than we would get with the same data in a row-oriented configuration, simply because data of the same type is easier to compress together than data of different types.