Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Pandas Cookbook

You're reading from   Pandas Cookbook Practical recipes for scientific computing, time series, and exploratory data analysis using Python

Arrow left icon
Product type Paperback
Published in Oct 2024
Publisher Packt
ISBN-13 9781836205876
Length 404 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
William Ayd William Ayd
Author Profile Icon William Ayd
William Ayd
Matthew Harrison Matthew Harrison
Author Profile Icon Matthew Harrison
Matthew Harrison
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. pandas Foundations FREE CHAPTER 2. Selection and Assignment 3. Data Types 4. The pandas I/O System 5. Algorithms and How to Apply Them 6. Visualization 7. Reshaping DataFrames 8. Group By 9. Temporal Data Types and Algorithms 10. General Usage and Performance Tips 11. The pandas Ecosystem 12. Index

Be cognizant of data sizes

As your datasets grow larger, you may find that you have to pick more optimal data types to ensure your pd.DataFrame can still fit into memory.

Back in Chapter 3, Data Types, we discussed the different integral types and how they are a trade-off between memory usage and capacity. When dealing with untyped data sources like CSV and Excel files, pandas will err on the side of using too much memory as opposed to picking the wrong capacity. This conservative approach can lead to inefficient usage of your system’s memory, so knowing how to optimize that can make the difference between loading a file and receiving an OutOfMemory error.

How to do it

To illustrate the impact of picking proper data types, let’s start with a relatively large pd.DataFrame composed of Python integers:

df = pd.DataFrame({
    "a": [0] * 100_000,
    "b": [2 ** 8] * 100_000,
    "c": [2 ** 16] * 100_000,
    "d": [2 ** 32] * 100_000,
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")
df.head()
    a    b       c          d
0   0  256  65536  4294967296
1   0  256  65536  4294967296
2   0  256  65536  4294967296
3   0  256  65536  4294967296
4   0  256  65536  4294967296

With the integral types, determining how much memory each pd.Series requires is a rather simple exercise. With a pd.Int64Dtype, each record is a 64-bit integer that requires 8 bytes of memory. Alongside each record, the pd.Series associates a single byte that is either 0 or 1, telling us if the record is missing or not. Thus, in total, we need 9 bytes for each record, and with 100,000 records per pd.Series, our memory usage should come out to 900,000 bytes. pd.DataFrame.memory_usage confirms that this math is correct:

df.memory_usage()
Index       128
a        900000
b        900000
c        900000
d        900000
dtype: int64

If you know what the types should be, you could explicitly pick better sizes for the pd.DataFrame columns using .astype:

df.assign(
    a=lambda x: x["a"].astype(pd.Int8Dtype()),
    b=lambda x: x["b"].astype(pd.Int16Dtype()),
    c=lambda x: x["c"].astype(pd.Int32Dtype()),
).memory_usage()
Index       128
a        200000
b        300000
c        500000
d        900000
dtype: int64

As a convenience, pandas can try and infer better sizes for you with a call pd.to_numeric. Passing the downcast="signed" argument will ensure that we continue to work with signed integers, and we will continue to pass dtype_backend="numpy_nullable" to ensure we get proper missing value support:

df.select_dtypes("number").assign(
    **{x: pd.to_numeric(
         y, downcast="signed", dtype_backend="numpy_nullable"
    ) for x, y in df.items()}
).memory_usage()
Index       128
a        200000
b        300000
c        500000
d        900000
dtype: int64
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime