Pandas Cookbook: Practical recipes for scientific computing, time series, and exploratory data analysis using Python , Third Edition

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Pandas Cookbook

pandas Foundations

The pandas library is useful for dealing with structured data. What is structured data? Data that is stored in tables, such as CSV files, Excel spreadsheets, or database tables, is all structured. Unstructured data consists of free-form text, images, sound, or video. If you find yourself dealing with structured data, pandas will be of great utility to you.

pd.Series is a one-dimensional collection of data. If you are coming from Excel, you can think of this as a column. The main difference is that, like a column in a database, all of the values within pd.Series must have a single, homogeneous type.

pd.DataFrame is a two-dimensional object. Much like an Excel sheet or database table can be thought of as a collection of columns, pd.DataFrame can be thought of as a collection of pd.Series objects. Each pd.Series has a homogeneous data type, but the pd.DataFrame is allowed to be heterogeneous and store a variety of pd.Series objects with different data types.

pd.Index does not have a direct analogy with other tools. Excel may offer the closest with auto-numbered rows on the left-hand side of a worksheet, but those numbers tend to be for display purposes only. pd.Index, as you will find over the course of this book, can be used for selecting values, joining tables, and much more.

The recipes in this chapter will show you how to manually construct pd.Series and pd.DataFrame objects, customize the pd.Index object(s) associated with each, and showcase common attributes of the pd.Series and pd.DataFrame that you may need to inspect during your analyses.

We are going to cover the following recipes in this chapter:

Importing pandas
Series
DataFrame
Index
Series attributes
DataFrame attributes

Importing pandas

Most users of the pandas library will use an import alias so they can refer to it as pd. In general, in this book, we will not show the pandas and NumPy imports, but they look like this:

import pandas as pd
import numpy as np

While it is an optional dependency in the 2.x series of pandas, many examples in this book will also leverage the PyArrow library, which we assume to be imported as:

import pyarrow as pa

Series

The basic building block in pandas is a pd.Series, which is a one-dimensional array of data paired with a pd.Index. The index labels can be used as a simplistic way to look up values in the pd.Series, much like the Python dictionary built into the language uses key/value pairs (we will expand on this and much more pd.Index functionality in Chapter 2, Selection and Assignment).

The following section demonstrates a few ways of creating a pd.Series directly.

How to do it

The easiest way to construct a pd.Series is to provide a sequence of values, like a list of integers:

pd.Series([0, 1, 2])

0    0
1    1
2    2
dtype: int64

A tuple is another type of sequence, making it valid as an argument to the pd.Series constructor:

pd.Series((12.34, 56.78, 91.01))

0    12.34
1    56.78
2    91.01
dtype: float64

When generating sample data, you may often reach for the Python range function:

pd.Series(range(0, 7, 2))

0    0
1    2
2    4
3    6
dtype: int64

In all of the examples so far, pandas will try and infer a proper data type from its arguments for you. However, there are times when you will know more about the type and size of your data than can be inferred. Providing that information explicitly to pandas via the dtype= argument can be useful to save memory or ensure proper integration with other typed systems, like SQL databases.

To illustrate this, let’s use a simple range argument to fill a pd.Series with a sequence of integers. When we did this before, the inferred data type was a 64-bit integer, but we, as developers, may know that we never expect to store larger values in this pd.Series and would be fine with only 8 bits of storage (if you do not know the difference between an 8-bit and 64-bit integer, that topic will be covered in Chapter 3, Data Types). Passing dtype="int8" to the pd.Series constructor will let pandas know we want to use the smaller data type:

pd.Series(range(3), dtype="int8")

0    0
1    1
2    2
dtype: int8

A pd.Series can also have a name attached to it, which can be specified via the name= argument (if not specified, the name defaults to None):

pd.Series(["apple", "banana", "orange"], name="fruit")

0     apple
1     banana
2     orange
Name: fruit, dtype: object

DataFrame

While pd.Series is the building block, pd.DataFrame is the main object that comes to mind for users of pandas. pd.DataFrame is the primary and most commonly used object in pandas, and when people think of pandas, they typically envision working with a pd.DataFrame.

In most analysis workflows, you will be importing your data from another source, but for now, we will show you how to construct a pd.DataFrame directly (input/output will be covered in Chapter 4, The pandas I/O System).

How to do it

The most basic construction of a pd.DataFrame happens with a two-dimensional sequence, like a list of lists:

pd.DataFrame([
    [0, 1, 2],
    [3, 4, 5],
    [6, 7, 8],
])

    0   1   2
0   0   1   2
1   3   4   5
2   6   7   8

With a list of lists, pandas will automatically number the row and column labels for you. Typically, users of pandas will at least provide labels for columns, as it makes indexing and selecting from a pd.DataFrame much more intuitive (see Chapter 2, Selection and Assignment, for an introduction to indexing and selecting). To label your columns when constructing a pd.DataFrame from a list of lists, you can provide a columns= argument to the constructor:

pd.DataFrame([
    [1, 2],
    [4, 8],
], columns=["col_a", "col_b"])

     col_a    col_b
0    1          2
1    4          8

Instead of using a list of lists, you could also provide a dictionary. The keys of the dictionary will be used as column labels, and the values of the dictionary will represent the values placed in that column of the pd.DataFrame:

pd.DataFrame({
    "first_name": ["Jane", "John"],
    "last_name": ["Doe", "Smith"],
})

            first_name      last_name
0           Jane            Doe
1           John            Smith

In the above example, our dictionary values were lists of strings, but the pd.DataFrame does not strictly require lists. Any sequence will work, including a pd.Series:

ser1 = pd.Series(range(3), dtype="int8", name="int8_col")
ser2 = pd.Series(range(3), dtype="int16", name="int16_col")
pd.DataFrame({ser1.name: ser1, ser2.name: ser2})

             int8_col         int16_col
0            0                0
1            1                1
2            2                2

Index

When constructing both the pd.Series and pd.DataFrame objects in the previous sections, you likely noticed the values to the left of these objects starting at 0 and incrementing by 1 for each new row of data. The object responsible for those values is the pd.Index, highlighted in the following image:

Figure 1.1: Default pd.Index, highlighted in red

In the case of a pd.DataFrame, you have a pd.Index not only to the left of the object (often referred to as the row index or even just index) but also above (often referred to as the column index or columns):

Figure 1.2: A pd.DataFrame with a row and column index

Unless explicitly provided, pandas will create an auto-numbered pd.Index for you (technically, this is a pd.RangeIndex, a subclass of the pd.Index class). However, it is very rare to use pd.RangeIndex for your columns, as referring to a column named City or Date is more expressive than referring to a column in the n^th position. The pd.RangeIndex appears more commonly in the row index, although you may still want custom labels to appear there as well. More advanced selection operations with the default pd.RangeIndex and custom pd.Index values will be covered in Chapter 2, Selection and Assignment, to help you understand different use cases, but for now, let’s just look at how you would override the construction of the row and column pd.Index objects during pd.Series and pd.DataFrame construction.

How to do it

When constructing a pd.Series, the easiest way to change the row index is by providing a sequence of labels to the index= argument. In this example, the labels dog, cat, and human will be used instead of the default pd.RangeIndex numbered from 0 to 2:

pd.Series([4, 4, 2], index=["dog", "cat", "human"])

dog          4
cat          4
human        2
dtype: int64

If you want finer control, you may want to construct the pd.Index yourself before passing it as an argument to index=. In the following example, the pd.Index is given the name animal, and the pd.Series itself is named num_legs, providing more context to the data:

index = pd.Index(["dog", "cat", "human"], name="animal")
pd.Series([4, 4, 2], name="num_legs", index=index)

animal
dog          4
cat          4
human        2
Name: num_legs, dtype: int64

A pd.DataFrame uses a pd.Index for both dimensions. Much like with the pd.Series constructor, the index= argument can be used to specify the row labels, but you now also have the columns= argument to control the column labels:

pd.DataFrame([
    [24, 180],
    [42, 166],
], columns=["age", "height_cm"], index=["Jack", "Jill"])

         age    height_cm
Jack     24     180
Jill     42     166

Series attributes

Once you have a pd.Series, there are quite a few attributes you may want to inspect. The most basic attributes can tell you the type and size of your data, which is often the first thing you will inspect when reading in data from a data source.

How to do it

Let’s start by creating a pd.Series that has a name, alongside a custom pd.Index, which itself has a name. Although not all of these elements are required, having them will help us more clearly understand what the attributes we access through this recipe are actually showing us:

index = pd.Index(["dog", "cat", "human"], name="animal")
ser = pd.Series([4, 4, 2], name="num_legs", index=index)
ser

animal
dog      4
cat      4
human    2
Name: num_legs, dtype: int64

The first thing users typically want to know about their data is the type of pd.Series. This can be inspected via the pd.Series.dtype attribute:

ser.dtype

dtype('int64')

The name may be inspected via the pd.Series.name attribute. The data we constructed in this recipe was created with the name="num_legs" argument, which is what you will see when accessing this attribute (if not provided, this will return None):

ser.name

num_legs

The associated pd.Index can be accessed via pd.Series.index:

ser.index

Index(['dog', 'cat', 'human'], dtype='object', name='animal')

The name of the associated pd.Index can be accessed via pd.Series.index.name:

ser.index.name

animal

The shape can be accessed via pd.Series.shape. For a one-dimensional pd.Series, the shape is returned as a one-tuple where the first element represents the number of rows:

ser.shape

The size (number of elements) can be accessed via pd.Series.size:

ser.size

The Python built-in function len can show you the length (number of rows):

len(ser)

DataFrame attributes

The pd.DataFrame shares many of the attributes of the pd.Series, with some slight differences. Generally, pandas tries to share as many attributes as possible between the pd.Series and pd.DataFrame, but the two-dimensional nature of the pd.DataFrame makes it more natural to express some things in plural form (for example, the .dtype attribute becomes .dtypes) and gives us a few more attributes to inspect (for example, .columns exists for a pd.DataFrame but not for a pd.Series).

How to do it

Much like we did in the previous section, we are going to construct a pd.DataFrame with a custom pd.Index in the rows, while also using custom labels in the columns. This will be more helpful when inspecting the various attributes:

index = pd.Index(["Jack", "Jill"], name="person")
df = pd.DataFrame([
    [24, 180, "red"],
    [42, 166, "blue"],
], columns=["age", "height_cm", "favorite_color"], index=index)
df

           age    height_cm    favorite_color
person
Jack       24     180          red
Jill       42     166          blue

The types of each column can be inspected via the pd.DataFrame.dtypes attribute. This attribute returns a pd.Series where each row shows the data type corresponding to each column in our pd.DataFrame:

df.dtypes

age                int64
height_cm          int64
favorite_color     object
dtype: object

The row index can be accessed via pd.DataFrame.index:

df.index

Index(['Jack', 'Jill'], dtype='object', name='person')

The column index can be accessed via pd.DataFrame.columns:

df.columns

Index(['age', 'height_cm', 'favorite_color'], dtype='object')

The shape can be accessed via pd.DataFrame.shape. For a two-dimensional pd.DataFrame, the shape is returned as a two-tuple where the first element represents the number of rows and the second element represents the number of columns:

df.shape

2     3

The size (number of elements) can be accessed via pd.DataFrame.size:

df.size

The Python built-in function len can show you the length (number of rows):

len(df)

Join our community on Discord

Join our community’s Discord space for discussions with the authors and other readers:

https://packt.link/pandas

Key benefits

This book targets features in pandas 2.x and beyond

Practical, easy to implement recipes for quick solutions to common problems in data using pandas

Master the fundamentals of pandas to quickly begin exploring any dataset

Description

Unlock the full power of pandas 2.x with this hands-on cookbook, designed for Python developers, data analysts, and data scientists who need fast, efficient solutions for real-world data challenges. This book provides practical, ready-to-use recipes to streamline your workflow. With step-by-step guidance, you'll master data wrangling, visualization, performance optimization, and scalable data analysis using pandas’ most powerful features. From importing and merging large datasets to advanced time series analysis and SQL-like operations, this cookbook equips you with the tools to analyze, manipulate, and visualize data like a pro. Learn how to boost efficiency, optimize memory usage, and seamlessly integrate pandas with NumPy, PyArrow, and databases. This book will help you transform raw data into actionable insights with ease.

Who is this book for?

This book is for Python developers, data scientists, engineers, and analysts. pandas is the ideal tool for manipulating structured data with Python and this book provides ample instruction and examples. Not only does it cover the basics required to be proficient, but it goes into the details of idiomatic pandas

What you will learn

The pandas type system and how to best navigate it

Import/export DataFrames to/from common data formats

Data exploration in pandas through dozens of practice problems

Grouping, aggregation, transformation, reshaping, and filtering data

Merge data from different sources through pandas SQL-like operations

Leverage the robust pandas time series functionality in advanced analyses

Scale pandas operations to get the most out of your system

The large ecosystem that pandas can coordinate with and supplement

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Filter reviews by

All

Packt verified reviews

Feefo verified reviews

Amazon verified reviews

JW Nov 02, 2024

Pandas Cookbook third edition is an excellent reference for pandas users. Ihave been writing Python code and using pandas for more than 10 years, andstill managed to learn something every chapter reading this book. I found it tobe written in such a way that it allowed for reading cover to cover, but wouldalso be useful jumping straight to the sections you need when trying to use asa reference.The examples in the book remained short enough for readers to use themselvesbut still clear enough to demonstrate both the "how" and "why." This book Ifound approachable enough that I would even recommend it to people just gettingintroduced to pandas. Experienced pandas will likely appreciate this bookexplains not just the "how" to accomplish a task but also the "why."Covering topics like reading data in from different sources, various ways toselect data, how to perform aggregations and transformations well, working withcomplex types (such as datetimes), performance tuning, and visualizations, thisis a book that I will find myself reaching for regularly.

Amazon Verified review

Robert Nov 03, 2024

The third edition of Cooking with Pandas was a welcome resource for learning about Pandas in Python. The book starts with the foundations and continues to build throughout. Like any good cookbook, there is a quick explanation of the material and a section on how to perform the task.The book is for anyone interested in Pandas, from beginners to well-seasoned developers. You can't go wrong by picking up this book. You will learn a lot!

Souvik Roy Nov 02, 2024

The book is perfect for beginners as there are tons of resources online, and this book tries to bring it all up in one place. I am new to Time Series, so this book will personally help me to know more about it and make me efficient in using the knowledge in the book in my real-world projects.

Fernando Villanueva Dec 14, 2024

Feefo Verified review

Rajavel S Feb 09, 2025

Covers below topics with good examples and key information that can help speed up your work- Pandas Foundation (series, data frames, index, etc.,)- Selection and Assignments- Data Types- Pandas I/O Systems- Algorithms & How to apply them- Visualizations- Reshaping Dataframes- Group By- Temporal Data Types and Algorithms- General Usage and Performance Tips- Pandas ecosystemsIt also gives you with some other books recommendations for learning Panda at Ease.Hope you enjoy learning and utilizing this book! Read more

Pandas Cookbook: Practical recipes for scientific computing, time series, and exploratory data analysis using Python , Third Edition

What do you get with a Packt Subscription?

Pandas Cookbook

pandas Foundations

Importing pandas

Series

How to do it

DataFrame

How to do it

Index

How to do it

Series attributes

How to do it

DataFrame attributes

How to do it

Join our community on Discord

Page 1 of 7

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

About the authors

FAQs

Pandas Cookbook: Practical recipes for scientific computing, time series, and exploratory data analysis using Python , Third Edition

What do you get with a Packt Subscription?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

About the authors

FAQs