Accelerating pandas processing with Apache Arrow
When dealing with large amounts of data, such as in whole genome sequencing, pandas is both slow and memory-consuming. Apache Arrow provides faster and more memory-efficient implementations of several pandas operations and can interoperate with it.
Apache Arrow is a project co-founded by Wes McKinney, the founder of pandas, and it has several objectives, including working with tabular data in a language-agnostic way, which allows for language interoperability while providing a memory- and computation-efficient implementation. Here, we will only be concerned with the second part: getting more efficiency for large-data processing. We will do this in an integrated way with pandas.
Here, we will once again use VAERS data and show how Apache Arrow can be used to accelerate pandas data loading and reduce memory consumption.
Getting ready
Again, we will be using data from the first recipe. Be sure you download and prepare it, as explained in the Getting ready section of the Using pandas to process vaccine-adverse events recipe. The code is available in Chapter02/Arrow.py
.
How to do it...
Follow these steps:
- Let’s start by loading the data using both pandas and Arrow:
import gzip import pandas as pd from pyarrow import csv import pyarrow.compute as pc vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1") columns = list(vdata_pd.columns) vdata_pd.info(memory_usage="deep") vdata_arrow = csv.read_csv("2021VAERSDATA.csv.gz") tot_bytes = sum([ vdata_arrow[name].nbytes for name in vdata_arrow.column_names]) print(f"Total {tot_bytes // (1024 ** 2)} MB")
pandas requires 1.3 GB, whereas Arrow requires 614 MB: less than half the memory. For large files like this, this may mean the difference between being able to process data in memory or needing to find another solution, such as Dask. While some functions in Arrow have similar names to pandas (for example, read_csv
), that is not the most common occurrence. For example, note the way we compute the total size of the DataFrame: by getting the size of each column and performing a sum, which is a different approach from pandas.
- Let’s do a side-by-side comparison of the inferred types:
for name in vdata_arrow.column_names: arr_bytes = vdata_arrow[name].nbytes arr_type = vdata_arrow[name].type pd_bytes = vdata_pd[name].memory_usage(index=False, deep=True) pd_type = vdata_pd[name].dtype print( name, arr_type, arr_bytes // (1024 ** 2), pd_type, pd_bytes // (1024 ** 2),)
Here is an abridged version of the output:
VAERS_ID int64 4 int64 4 RECVDATE string 8 object 41 STATE string 3 object 34 CAGE_YR int64 5 float64 4 SEX string 3 object 36 RPT_DATE string 2 object 20 DIED string 2 object 20 L_THREAT string 2 object 20 ER_VISIT string 2 object 19 HOSPITAL string 2 object 20 HOSPDAYS int64 5 float64 4
As you can see, Arrow is generally more specific with type inference and is one of the main reasons why memory usage is substantially lower.
- Now, let’s do a time performance comparison:
%timeit pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1") %timeit csv.read_csv("2021VAERSDATA.csv.gz")
On my computer, the results are as follows:
7.36 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 2.28 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Arrow’s implementation is three times faster. The results on your computer will vary as this is dependent on the hardware.
- Let’s repeat the memory occupation comparison while not loading the
SYMPTOM_TEXT
column. This is a fairer comparison as most numerical datasets do not tend to have a very large text column:vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1", usecols=lambda x: x != "SYMPTOM_TEXT") vdata_pd.info(memory_usage="deep") columns.remove("SYMPTOM_TEXT") vdata_arrow = csv.read_csv( "2021VAERSDATA.csv.gz", convert_options=csv.ConvertOptions(include_columns=columns)) vdata_arrow.nbytes
pandas requires 847 MB, whereas Arrow requires 205 MB: four times less.
- Our objective is to use Arrow to load data into pandas. For that, we need to convert the data structure:
vdata = vdata_arrow.to_pandas() vdata.info(memory_usage="deep")
There are two very important points to be made here: the pandas representation created by Arrow uses only 1 GB, whereas the pandas representation, from its native read_csv
, is 1.3 GB. This means that even if you use pandas to process data, Arrow can create a more compact representation to start with.
The preceding code has one problem regarding memory consumption: when the converter is running, it will require memory to hold both the pandas and the Arrow representations, hence defeating the purpose of using less memory. Arrow can self-destruct its representation while creating the pandas version, hence resolving the problem. The line for this is vdata = vdata_arrow.to_pandas(self_destruct=True)
.
There’s more...
If you have a very large DataFrame that cannot be processed by pandas, even after it’s been loaded by Arrow, then maybe Arrow can do all the processing as it has a computing engine as well. That being said, Arrow’s engine is, at the time of writing, substantially less complete in terms of functionality than pandas. Remember that Arrow has many other features, such as language interoperability, but we will not be making use of those in this book.