Be cognizant of data sizes
As your datasets grow larger, you may find that you have to pick more optimal data types to ensure your pd.DataFrame
can still fit into memory.
Back in Chapter 3, Data Types, we discussed the different integral types and how they are a trade-off between memory usage and capacity. When dealing with untyped data sources like CSV and Excel files, pandas will err on the side of using too much memory as opposed to picking the wrong capacity. This conservative approach can lead to inefficient usage of your system’s memory, so knowing how to optimize that can make the difference between loading a file and receiving an OutOfMemory
error.
How to do it
To illustrate the impact of picking proper data types, let’s start with a relatively large pd.DataFrame
composed of Python integers:
df = pd.DataFrame({
"a": [0] * 100_000,
"b": [2 ** 8] * 100_000,
"c": [2 ** 16] * 100_000,
"d": [2...