Reducing the memory usage of pandas DataFrames
When you are dealing with lots of information – for example, when analyzing whole genome sequencing data – memory usage may become a limitation for your analysis. It turns out that naïve pandas is not very efficient from a memory perspective, and we can substantially reduce its consumption.
In this recipe, we are going to revisit our VAERS data and look at several ways to reduce pandas memory usage. The impact of these changes can be massive: in many cases, reducing memory consumption may mean the difference between being able to use pandas or requiring a more alternative and complex approach, such as Dask or Spark.
Getting ready
We will be using the data from the first recipe. If you have run it, you are all set; if not, please follow the steps discussed there. You can find this code in Chapter02/Pandas_Memory.py
.
How to do it…
Follow these steps:
- First, let’s load the data and inspect the size of the DataFrame:
import numpy as np import pandas as pd vdata = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1") vdata.info(memory_usage="deep")
Here is an abridged version of the output:
RangeIndex: 654986 entries, 0 to 654985 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VAERS_ID 654986 non-null int64 2 STATE 572236 non-null object 3 AGE_YRS 583424 non-null float64 6 SEX 654986 non-null object 8 SYMPTOM_TEXT 654828 non-null object 9 DIED 8536 non-null object 31 BIRTH_DEFECT 383 non-null object 34 ALLERGIES 330630 non-null object dtypes: float64(5), int64(2), object(28) memory usage: 1.3 GB
Here, we have information about the number of rows and the type and non-null values of each row. Finally, we can see that the DataFrame requires a whopping 1.3 GB.
- We can also inspect the size of each column:
for name in vdata.columns: col_bytes = vdata[name].memory_usage(index=False, deep=True) col_type = vdata[name].dtype print( name, col_type, col_bytes // (1024 ** 2))
Here is an abridged version of the output:
VAERS_ID int64 4 STATE object 34 AGE_YRS float64 4 SEX object 36 RPT_DATE object 20 SYMPTOM_TEXT object 442 DIED object 20 ALLERGIES object 34
SYMPTOM_TEXT
occupies 442 MB, so 1/3 of our entire table.
- Now, let’s look at the
DIED
column. Can we find a more efficient representation?vdata.DIED.memory_usage(index=False, deep=True) vdata.DIED.fillna(False).astype(bool).memory_usage(index=False, deep=True)
The original column takes 21,181,488 bytes, whereas our compact representation takes 656,986 bytes. That’s 32 times less!
- What about the
STATE
column? Can we do better?vdata["STATE"] = vdata.STATE.str.upper() states = list(vdata["STATE"].unique()) vdata["encoded_state"] = vdata.STATE.apply(lambda state: states.index(state)) vdata["encoded_state"] = vdata["encoded_state"].astype(np.uint8) vdata["STATE"].memory_usage(index=False, deep=True) vdata["encoded_state"].memory_usage(index=False, deep=True)
Here, we convert the STATE
column, which is text, into encoded_state
, which is a number. This number is the position of the state’s name in the list state. We use this number to look up the list of states. The original column takes around 36 MB, whereas the encoded column takes 0.6 MB.
As an alternative to this approach, you can look at categorical variables in pandas. I prefer to use them as they have wider applications.
- We can apply most of these optimizations when we load the data, so let’s prepare for that. But now, we have a chicken-and-egg problem: to be able to know the content of the state table, we have to do a first pass to get the list of states, like so:
states = list(pd.read_csv( "vdata_sample.csv.gz", converters={ "STATE": lambda state: state.upper() }, usecols=["STATE"] )["STATE"].unique())
We have a converter that simply returns the uppercase version of the state. We only return the STATE
column to save memory and processing time. Finally, we get the STATE
column from the DataFrame (which has only a single column).
- The ultimate optimization is not to load the data. Imagine that we don’t need
SYMPTOM_TEXT
– that is around 1/3 of the data. In that case, we can just skip it. Here is the final version:vdata = pd.read_csv( "vdata_sample.csv.gz", index_col="VAERS_ID", converters={ "DIED": lambda died: died == "Y", "STATE": lambda state: states.index(state.upper()) }, usecols=lambda name: name != "SYMPTOM_TEXT" ) vdata["STATE"] = vdata["STATE"].astype(np.uint8) vdata.info(memory_usage="deep")
We are now at 714 MB, which is a bit over half of the original. This could be still substantially reduced by applying the methods we used for STATE
and DIED
to all other columns.
See also
The following is some extra information that may be useful:
- If you are willing to use a support library to help with Python processing, check the next recipe on Apache Arrow, which will allow you to have extra memory savings for more memory efficiency.
- If you end up with DataFrames that take more memory than you have available on a single machine, then you must step up your game and use chunking - which we will not cover in the Pandas context - or something that can deal with large data automatically. Dask, which we’ll cover in Chapter 11, Parallel Processing with Dask and Zarr, allows you to work with larger-than-memory datasets with, among others, a pandas-like interface.