You're reading from Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Product type Paperback

Published in Feb 2020

Publisher Packt

ISBN-13 9781839213106

Length 626 pages

Edition 2nd Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

Theodore Petrou

Matthew Harrison

View More author details

Table of Contents (17) Chapters

Preface

1. Pandas Foundations

2. Essential DataFrame Operations FREE CHAPTER

3. Creating and Persisting DataFrames

4. Beginning Data Analysis

5. Exploratory Data Analysis

6. Selecting Subsets of Data

7. Filtering Rows

8. Index Alignment

9. Grouping for Aggregation, Filtration, and Transformation

10. Restructuring Data into a Tidy Form

11. Combining Pandas Objects

12. Time Series Analysis

13. Visualization with Matplotlib, Pandas, and Seaborn

14. Debugging and Testing Pandas

15. Other Books You May Enjoy

16. Index

Comparing missing values

pandas uses the NumPy NaN (np.nan) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None object evaluates as True when compared to itself:

>>> np.nan == np.nan
False
>>> None == None
True

All other comparisons against np.nan also return False, except not equal to (!=):

>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True

Getting ready

Series and DataFrames use the equals operator, ==, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals method.

As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

How to do it...

To get an idea of how the equals operator works, let's compare each element to a scalar value:

>>> college_ugds == 0.0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values. You may be tempted to use the equals operator to compare two DataFrames with one another on an element-by-element basis. Take, for instance, college_ugds compared against itself, as follows:

>>> college_self_compare = college_ugds == college_ugds
>>> college_self_compare.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Alabama A...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Amridge U...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Alabama S...        True        True  ...      True       True

At first glance, all the values appear to be equal, as you would expect. However, using the .all method to determine if each column contains only True values yields an unexpected result:

>>> college_self_compare.all()
UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

This happens because missing values do not compare equally with one another. If you tried to count missing values using the equal operator and summing up the Boolean columns, you would get zero for each one:

>>> (college_ugds == np.nan).sum()
UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

Instead of using == to find missing numbers, use the .isna method:

>>> college_ugds.isna().sum()
UGDS_WHITE    661
UGDS_BLACK    661
UGDS_HISP     661
UGDS_ASIAN    661
UGDS_AIAN     661
UGDS_NHPI     661
UGDS_2MOR     661
UGDS_NRA      661
UGDS_UNKN     661
dtype: int64

The correct way to compare two entire DataFrames with one another is not with the equals operator (==) but with the .equals method. This method treats NaNs that are in the same location as equal (note that the .eq method is the equivalent of ==):
```
>>> college_ugds.equals(college_ugds)
True
```

How it works...

Step 1 compares a DataFrame to a scalar value while step 2 compares a DataFrame with another DataFrame. Both operations appear to be quite simple and intuitive at first glance. The second operation is checking whether the DataFrames have identically labeled indexes and thus the same number of elements. The operation will fail if this isn't the case.

Step 3 verifies that none of the columns in the DataFrames are equivalent to each other. Step 4 further shows the non-equivalence of np.nan and itself. Step 5 verifies that there are indeed missing values in the DataFrame. Finally, step 6 shows the correct way to compare DataFrames with the .equals method, which always returns a Boolean scalar value.

There's more...

All the comparison operators have method counterparts that allow for more functionality. Somewhat confusingly, the .eq DataFrame method does element-by-element comparison, just like the equals (==) operator. The .eq method is not at all the same as the .equals method. The following code duplicates step 1:

>>> college_ugds.eq(0.0019)  # same as college_ugds == .0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

Inside the pandas.testing sub-package, a function exists that developers should use when creating unit tests. The assert_frame_equal function raises an AssertionError if two DataFrames are not equal. It returns None if the two DataFrames are equal:

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True

Unit tests are a very important part of software development and ensure that the code is running correctly. pandas contains many thousands of unit tests that help ensure that it is running properly. To read more on how pandas runs its unit tests, see the Contributing to pandas section in the documentation (http://bit.ly/2vmCSU6).