You're reading from Pandas Cookbook Practical recipes for scientific computing, time series, and exploratory data analysis using Python

Product type Paperback

Published in Oct 2024

Publisher Packt

ISBN-13 9781836205876

Length 404 pages

Edition 3rd Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

William Ayd

Matthew Harrison

View More author details

Table of Contents (13) Chapters

Preface

1. pandas Foundations FREE CHAPTER

2. Selection and Assignment

3. Data Types

4. The pandas I/O System

5. Algorithms and How to Apply Them

6. Visualization

7. Reshaping DataFrames

8. Group By

9. Temporal Data Types and Algorithms

10. General Usage and Performance Tips

11. The pandas Ecosystem

12. Index

Be cognizant of data sizes

As your datasets grow larger, you may find that you have to pick more optimal data types to ensure your pd.DataFrame can still fit into memory.

Back in Chapter 3, Data Types, we discussed the different integral types and how they are a trade-off between memory usage and capacity. When dealing with untyped data sources like CSV and Excel files, pandas will err on the side of using too much memory as opposed to picking the wrong capacity. This conservative approach can lead to inefficient usage of your system’s memory, so knowing how to optimize that can make the difference between loading a file and receiving an OutOfMemory error.

How to do it

To illustrate the impact of picking proper data types, let’s start with a relatively large pd.DataFrame composed of Python integers:

df = pd.DataFrame({
    "a": [0] * 100_000,
    "b": [2 ** 8] * 100_000,
    "c": [2 ** 16] * 100_000,
    "d": [2 ** 32] * 100_000,
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df.head()

    a    b       c          d
0   0  256  65536  4294967296
1   0  256  65536  4294967296
2   0  256  65536  4294967296
3   0  256  65536  4294967296
4   0  256  65536  4294967296

With the integral types, determining how much memory each pd.Series requires is a rather simple exercise. With a pd.Int64Dtype, each record is a 64-bit integer that requires 8 bytes of memory. Alongside each record, the pd.Series associates a single byte that is either 0 or 1, telling us if the record is missing or not. Thus, in total, we need 9 bytes for each record, and with 100,000 records per pd.Series, our memory usage should come out to 900,000 bytes. pd.DataFrame.memory_usage confirms that this math is correct:

df.memory_usage()

Index       128
a        900000
b        900000
c        900000
d        900000
dtype: int64

If you know what the types should be, you could explicitly pick better sizes for the pd.DataFrame columns using .astype:

df.assign(
    a=lambda x: x["a"].astype(pd.Int8Dtype()),
    b=lambda x: x["b"].astype(pd.Int16Dtype()),
    c=lambda x: x["c"].astype(pd.Int32Dtype()),
).memory_usage()

Index       128
a        200000
b        300000
c        500000
d        900000
dtype: int64

As a convenience, pandas can try and infer better sizes for you with a call pd.to_numeric. Passing the downcast="signed" argument will ensure that we continue to work with signed integers, and we will continue to pass dtype_backend="numpy_nullable" to ensure we get proper missing value support:

df.select_dtypes("number").assign(
    **{x: pd.to_numeric(
         y, downcast="signed", dtype_backend="numpy_nullable"
    ) for x, y in df.items()}
).memory_usage()

Index       128
a        200000
b        300000
c        500000
d        900000
dtype: int64

The rest of the chapter is locked

You're reading from Pandas Cookbook Practical recipes for scientific computing, time series, and exploratory data analysis using Python

Table of Contents (13) Chapters

Be cognizant of data sizes

How to do it

Unlock this book and the full library FREE for 7 days

Authors (2)

Personalised recommendations for you