Dictionary-encode low cardinality data
Back in Chapter 3, Data Types, we talked about the categorical data type, which can help to reduce memory usage by replacing occurrences of strings (or any data type really) with much smaller integral code. While Chapter 3, Data Types, provides a good technical deep dive, it is worth restating this as a best practice here, given how significant of a saving this can represent when working with low cardinality data, i.e, data where the ratio of unique values to the overall record count is relatively low.
How to do it
Just to drive home the point about memory savings, let’s create a low cardinality pd.Series
. Our pd.Series
is going to have 300,000 rows of data, but only three unique values of "foo"
, "bar"
, and "baz"
:
values = ["foo", "bar", "baz"]
ser = pd.Series(values * 100_000, dtype=pd.StringDtype())
ser.memory_usage()
2400128
Simply changing this to a categorical...