Be cognizant of data sizes
As your datasets grow larger, you may find that you have to pick more optimal data types to ensure your pd.DataFrame
can still fit into memory.
Back in Chapter 3, Data Types, we discussed the different integral types and how they are a trade-off between memory usage and capacity. When dealing with untyped data sources like CSV and Excel files, pandas will err on the side of using too much memory as opposed to picking the wrong capacity. This conservative approach can lead to inefficient usage of your system’s memory, so knowing how to optimize that can make the difference between loading a file and receiving an OutOfMemory
error.
How to do it
To illustrate the impact of picking proper data types, let’s start with a relatively large pd.DataFrame
composed of Python integers:
df = pd.DataFrame({
"a": [0] * 100_000,
"b": [2 ** 8] * 100_000,
"c": [2 ** 16] * 100_000,
"d": [2 ** 32] * 100_000,
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df.head()
a b c d
0 0 256 65536 4294967296
1 0 256 65536 4294967296
2 0 256 65536 4294967296
3 0 256 65536 4294967296
4 0 256 65536 4294967296
With the integral types, determining how much memory each pd.Series
requires is a rather simple exercise. With a pd.Int64Dtype
, each record is a 64-bit integer that requires 8 bytes of memory. Alongside each record, the pd.Series
associates a single byte that is either 0 or 1, telling us if the record is missing or not. Thus, in total, we need 9 bytes for each record, and with 100,000 records per pd.Series
, our memory usage should come out to 900,000 bytes. pd.DataFrame.memory_usage
confirms that this math is correct:
df.memory_usage()
Index 128
a 900000
b 900000
c 900000
d 900000
dtype: int64
If you know what the types should be, you could explicitly pick better sizes for the pd.DataFrame
columns using .astype
:
df.assign(
a=lambda x: x["a"].astype(pd.Int8Dtype()),
b=lambda x: x["b"].astype(pd.Int16Dtype()),
c=lambda x: x["c"].astype(pd.Int32Dtype()),
).memory_usage()
Index 128
a 200000
b 300000
c 500000
d 900000
dtype: int64
As a convenience, pandas can try and infer better sizes for you with a call pd.to_numeric
. Passing the downcast="signed"
argument will ensure that we continue to work with signed integers, and we will continue to pass dtype_backend="numpy_nullable"
to ensure we get proper missing value support:
df.select_dtypes("number").assign(
**{x: pd.to_numeric(
y, downcast="signed", dtype_backend="numpy_nullable"
) for x, y in df.items()}
).memory_usage()
Index 128
a 200000
b 300000
c 500000
d 900000
dtype: int64