Pandas Cookbook: Practical recipes for scientific computing, time series, and exploratory data analysis using Python , Third Edition

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Pandas Cookbook

Selection and Assignment

In the previous chapter, we looked at how to create a pd.Series and pd.DataFrame, and we also looked at their relationship to the pd.Index. With a foundation in constructors, we now shift focus to the crucial processes of selection and assignment. Selection, also referred to as indexing, is considered a getter; i.e., it is used to retrieve values from a pandas object. Assignment, by contrast, is a setter that is used to update values.

The recipes in this chapter start out by showing you how to retrieve values from pd.Series and pd.DataFrame objects, with ever-increasing complexity. We will eventually introduce the pd.MultiIndex, which can be used to select data hierarchically, before finally ending with an introduction to the assignment operators. The pandas API takes great care to reuse many of the same methods for selection and assignment, which ultimately allows you to be very expressive in how you would like to interact with your data.

By the end...

Basic selection from a Series

Selection from a pd.Series involves accessing elements either by their position or by their label. This is akin to accessing elements in a list by their index or in a dictionary by their key, respectively. The versatility of the pd.Series object allows intuitive and straightforward data retrieval, making it an essential tool for data manipulation.

The pd.Series is considered a container in Python, much like the built-in list, tuple, and dict objects. As such, for simple selection operations, the first place users turn to is the Python index operator, using the [] syntax.

How to do it

To introduce the basics of selection, let’s start with a very simple pd.Series:

ser = pd.Series(list("abc") * 3)
ser

0    a
1    b
2    c
3    a
4    b
5    c
6    a
7    b
8    c
dtype: object

In Python, you’ve already discovered that the [] operator can be used to select elements from a container; i.e., some_dictionary[0...

Basic selection from a DataFrame

When using the [] operator with a pd.DataFrame, simple selection typically involves selecting data from the column index rather than the row index. This distinction is crucial for effective data manipulation and analysis. Columns in a pd.DataFrame can be accessed by their labels, making it easy to work with named data from a pd.Series within the larger pd.DataFrame structure.

Understanding this fundamental difference in selection behavior is key to utilizing the full power of a pd.DataFrame in pandas. By leveraging the [] operator, you can efficiently access and manipulate specific columns of data, setting the stage for more advanced operations and analyses.

How to do it

Let’s start by creating a simple 3x3 pd.DataFrame. The values of the pd.DataFrame are not important, but we are intentionally going to provide our own column labels instead of having pandas create an auto-numbered column index for us:

df = pd.DataFrame(np.arange...

Position-based selection of a Series

As discussed back in the Basic selection from a DataFrame section, using [] as a selection mechanism does not signal the clearest intent and can sometimes be downright confusing. The fact that ser[42] selects from a label matching the number 42 and not the 42^nd row of a pd.Series is a common mistake for new users, and such an ambiguity can grow even more complex as you start trying to select two dimensions with the [] operator from a pd.DataFrame.

To clearly signal that you are trying to select by position instead of by label, you should use pd.Series.iloc.

How to do it

Let’s create a pd.Series where we have an index using integral labels that are also non-unique:

ser = pd.Series(["apple", "banana", "orange"], index=[0, 1, 1])
ser

0     apple
1    banana
1    orange
dtype: object

To select a scalar, you can use pd.Series.iloc with an integer argument:

ser.iloc[1]

banana

...

Position-based selection of a DataFrame

Much like with a pd.Series, integers, lists of integers, and slice objects are all valid arguments to DataFrame.iloc. However, with a pd.DataFrame, two arguments are required. The first argument handles selecting from the rows, and the second is responsible for the columns.

In most use cases, users reach for position-based selection when retrieving rows and label-based selection when retrieving columns. We will cover the latter in the Label-based selection from a DataFrame section and will show you how to combine both in the Mixing position-based and label-based selection section. However, when your row index uses the default pd.RangeIndex and the order of columns is significant, the techniques shown in this section will be of immense value.

How to do it

Let’s create a pd.DataFrame with five rows and four columns:

df = pd.DataFrame(np.arange(20).reshape(5, -1), columns=list("abcd"))
df

     a     b     c...

Label-based selection from a Series

In pandas, pd.Series.loc is used to perform selection by label instead of by position. This method is particularly useful when you consider the pd.Index of your pd.Series to contain lookup values, much like the key in a Python dictionary, rather than giving importance to the order or position of data in your pd.Series.

How to do it

Let’s create a pd.Series where we have a row index using integral labels that are also non-unique:

ser = pd.Series(["apple", "banana", "orange"], index=[0, 1, 1])
ser

0     apple
1    banana
1    orange
dtype: object

pd.Series.loc will select all rows where the index has a label of 1:

ser.loc[1]

1    banana
1    orange
dtype: object

Of course, you are not limited to integral labels in pandas. Let’s see what this looks like with a pd.Index composed of string values:

ser = pd.Series([2, 2, 4], index=["dog", "cat", &quot...

Mixing position-based and label-based selection

Since pd.DataFrame.iloc is used for position-based selection and pd.DataFrame.loc is for label-based selection, users must take an extra step if attempting to select by label in one dimension and by position in another. As mentioned in previous sections, the majority of pd.DataFrame objects constructed will place heavy significance on the labels used for the columns, with little care for how those columns are ordered. The inverse is true for the rows, so being able to effectively mix and match both styles is of immense value.

How to do it

Let’s start with a pd.DataFrame that uses the default auto-numbered pd.RangeIndex in the rows but has custom string labels for the columns:

df = pd.DataFrame([
    [24, 180, "blue"],
    [42, 166, "brown"],
    [22, 160, "green"],
], columns=["age", "height_cm", "eye_color"])
df

     age   height_cm    eye_color
0  ...

DataFrame.filter

pd.DataFrame.filter is a specialized method that allows you to select from either the rows or columns of a pd.DataFrame.

How to do it

Let’s create a pd.DataFrame where we have indices composed of strings in both the rows and columns:

df = pd.DataFrame([
    [24, 180, "blue"],
    [42, 166, "brown"],
    [22, 160, "green"],
], columns=[
    "age",
    "height_cm",
    "eye_color"
], index=["Jack", "Jill", "Jayne"])
df

        age   height_cm   eye_color
Jack    24    180         blue
Jill    42    166         brown
Jayne   22    160         green

By default, pd.DataFrame.filter will select columns matching the label argument(s), similar to pd.DataFrame[]:

df.filter(["age", "eye_color"])

       age   eye_color
Jack   24    blue
Jill   42    brown
Jayne  22    green

However, pd.DataFrame.filter also accepts an axis=...

Selection by data type

So far in this cookbook, we have seen data types, but we have not talked too much in depth about what they are. We still aren’t quite there; a deep dive into the type system of pandas is reserved for Chapter 3, Data Types. However, for now, you should be aware that the column type provides metadata that pd.DataFrame.select_dtypes can use for selection.

How to do it

Let’s start with a pd.DataFrame that uses integral, floating point, and string columns:

df = pd.DataFrame([
    [0, 1.0, "2"],
    [4, 8.0, "16"],
], columns=["int_col", "float_col", "string_col"])
df

    int_col   float_col   string_col
0   0         1.0         2
1   4         8.0         16

Use pd.DataFrame.select_dtypes to select only integral columns:

df.select_dtypes("int")

    int_col
0   0
1   4

Multiple types can be selected if you pass a list argument:

df.select_dtypes(include...

Selection/filtering via Boolean arrays

Using Boolean lists/arrays (also referred to as masks) is a very common method to select a subset of rows.

How to do it

Let’s create a mask of True=/=False values alongside a simple pd.Series:

mask = [True, False, True]
ser = pd.Series(range(3))
ser

0    0
1    1
2    2
dtype: int64

Using the mask as an argument to pd.Series[] will return each row where the corresponding mask entry is True:

ser[mask]

0    0
2    2
dtype: int64

pd.Series.loc will match the exact same behavior as pd.Series[] in this particular case:

ser.loc[mask]

0    0
2    2
dtype: int64

Interestingly, whereas pd.DataFrame[] usually tries to select from the columns when provided a list argument, its behavior with a sequence of Boolean values is different. Using the mask we have already created, df[mask] will actually match along the rows rather than the columns:

df = pd.DataFrame(np.arange(6).reshape(3, -1))
df[mask...

Selection with a MultiIndex – A single level

A pd.MultiIndex is a subclass of a pd.Index that supports hierarchical labels. Depending on who you ask, this can be one of the best or one of the worst features of pandas. After reading this cookbook, I hope you consider it one of the best.

Much of the derision toward the pd.MultiIndex comes from the fact that the syntax used to select from it can easily become ambiguous, especially when using pd.DataFrame[]. The examples below exclusively use the pd.DataFrame.loc method and avoid pd.DataFrame[] to mitigate confusion.

How to do it

pd.MultiIndex.from_tuples can be used to construct a pd.MultiIndex from a list of tuples. In the following example, we create a pd.MultiIndex with two levels – first_name and last_name, sequentially. We will pair this alongside a very simple pd.Series:

index = pd.MultiIndex.from_tuples([
    ("John", "Smith"),
    ("John", "Doe"),
    (&quot...

Selection with a MultiIndex – Multiple levels

Things would not be that interesting if you could only select from the first level of a pd.MultiIndex. Fortunately, pd.DataFrame.loc will scale out to more than just the first level through the creative use of tuple arguments.

How to do it

Let’s recreate the pd.Series from the previous section:

index = pd.MultiIndex.from_tuples([
    ("John", "Smith"),
    ("John", "Doe"),
    ("Jane", "Doe"),
    ("Stephen", "Smith"),
], names=["first_name", "last_name"])
ser = pd.Series(range(4), index=index)
ser

first_name  last_name
John        Smith        0
            Doe          1
Jane        Doe          2
Stephen     Smith        3
dtype: int64

To select all records where the first index level uses the label "Jane" and the second uses "Doe", pass the following tuple:

ser.loc[("Jane...

Selection with a MultiIndex – a DataFrame

A pd.MultiIndex can be used both as a row index and a column index, and selection via pd.DataFrame.loc works with both.

How to do it

Let’s create a pd.DataFrame that uses a pd.MultiIndex in both the rows and columns:

row_index = pd.MultiIndex.from_tuples([
    ("John", "Smith"),
    ("John", "Doe"),
    ("Jane", "Doe"),
    ("Stephen", "Smith"),
], names=["first_name", "last_name"])
col_index = pd.MultiIndex.from_tuples([
    ("music", "favorite"),
    ("music", "last_seen_live"),
    ("art", "favorite"),
], names=["art_type", "category"])
df = pd.DataFrame([
   ["Swift", "Swift", "Matisse"],
   ["Mozart", "T. Swift", "Van Gogh"],
   ["Beatles", "Wonder", &quot...

Item assignment with .loc and .iloc

The pandas library is optimized for reading, exploring, and evaluating data. Operations that try to mutate or change data are far less efficient.

However, when you must mutate your data, you can use .loc and .iloc to do it.

How to do it

Let’s start with a very small pd.Series:

ser = pd.Series(range(3), index=list("abc"))

pd.Series.loc is useful when you want to assign a value by matching against the label of an index. For example, if we wanted to store the value 42 where our row index contained a value of "b", we would write:

ser.loc["b"] = 42
ser

a     0
b    42
c     2
dtype: int64

pd.Series.iloc is used when you want to assign a value positionally. To assign the value -42 to the second element in our pd.Series, we would write:

ser.iloc[2] = -42
ser

a     0
b    42
c   -42
dtype: int64

There’s more…

The cost of mutating data through pandas can...

DataFrame column assignment

While assigning to data can be a relatively expensive operation in pandas, assigning columns to a pd.DataFrame is a common operation.

How to do it

Let’s create a very simple pd.DataFrame:

df = pd.DataFrame({"col1": [1, 2, 3]})
df

New columns can be assigned using the pd.DataFrame[] operator. The simplest type of assignment can take a scalar value and broadcast it to every row of the pd.DataFrame:

df["new_column1"] = 42
df

    col1   new_column1
0   1      42
1   2      42
2   3      42

You can also assign a pd.Series or sequence as long as the number of elements matches the number of rows in the pd.DataFrame:

df["new_column2"] = list("abc")
df

    col1   new_column1   new_column2
0   1      42            a
1   2      42            b
2   3      42            c

df["new_column3"] = pd.Series(["dog", "cat&quot...

Key benefits

This book targets features in pandas 2.x and beyond

Practical, easy to implement recipes for quick solutions to common problems in data using pandas

Master the fundamentals of pandas to quickly begin exploring any dataset

Description

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. With this latest edition unlock the full potential of pandas 2.x onwards. Whether you're a beginner or an experienced data analyst, this book offers a wealth of practical recipes to help you excel in your data analysis projects. This cookbook covers everything from fundamental data manipulation tasks to advanced techniques for handling big data, visualization, and more. Each recipe is designed to address common real-world challenges, providing clear explanations and step-by-step instructions to guide you through the process. Explore cutting-edge topics such as idiomatic pandas coding, efficient handling of large datasets, and advanced data visualization techniques.  Whether you're looking to sharpen or expand your skills, the "Pandas Cookbook" is your essential companion for mastering data analysis and manipulation with pandas 2.x, and beyond.

Who is this book for?

This book is for Python developers, data scientists, engineers, and analysts. pandas is the ideal tool for manipulating structured data with Python and this book provides ample instruction and examples. Not only does it cover the basics required to be proficient, but it goes into the details of idiomatic pandas

What you will learn

The pandas type system and how to best navigate it

Import/export DataFrames to/from common data formats

Data exploration in pandas through dozens of practice problems

Grouping, aggregation, transformation, reshaping, and filtering data

Merge data from different sources through pandas SQL-like operations

Leverage the robust pandas time series functionality in advanced analyses

Scale pandas operations to get the most out of your system

The large ecosystem that pandas can coordinate with and supplement

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Filter reviews by

All

Packt verified reviews

Feefo verified reviews

Amazon verified reviews

JW Nov 02, 2024

Pandas Cookbook third edition is an excellent reference for pandas users. Ihave been writing Python code and using pandas for more than 10 years, andstill managed to learn something every chapter reading this book. I found it tobe written in such a way that it allowed for reading cover to cover, but wouldalso be useful jumping straight to the sections you need when trying to use asa reference.The examples in the book remained short enough for readers to use themselvesbut still clear enough to demonstrate both the "how" and "why." This book Ifound approachable enough that I would even recommend it to people just gettingintroduced to pandas. Experienced pandas will likely appreciate this bookexplains not just the "how" to accomplish a task but also the "why."Covering topics like reading data in from different sources, various ways toselect data, how to perform aggregations and transformations well, working withcomplex types (such as datetimes), performance tuning, and visualizations, thisis a book that I will find myself reaching for regularly.

Amazon Verified review

Robert Nov 03, 2024

The third edition of Cooking with Pandas was a welcome resource for learning about Pandas in Python. The book starts with the foundations and continues to build throughout. Like any good cookbook, there is a quick explanation of the material and a section on how to perform the task.The book is for anyone interested in Pandas, from beginners to well-seasoned developers. You can't go wrong by picking up this book. You will learn a lot!

Souvik Roy Nov 02, 2024

The book is perfect for beginners as there are tons of resources online, and this book tries to bring it all up in one place. I am new to Time Series, so this book will personally help me to know more about it and make me efficient in using the knowledge in the book in my real-world projects.

Fernando Villanueva Dec 14, 2024

Feefo Verified review

Rajavel S Feb 09, 2025

Covers below topics with good examples and key information that can help speed up your work- Pandas Foundation (series, data frames, index, etc.,)- Selection and Assignments- Data Types- Pandas I/O Systems- Algorithms & How to apply them- Visualizations- Reshaping Dataframes- Group By- Temporal Data Types and Algorithms- General Usage and Performance Tips- Pandas ecosystemsIt also gives you with some other books recommendations for learning Panda at Ease.Hope you enjoy learning and utilizing this book! Read more

Pandas Cookbook: Practical recipes for scientific computing, time series, and exploratory data analysis using Python , Third Edition

What do you get with a Packt Subscription?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

About the authors

FAQs