Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Pandas 1.x Cookbook

You're reading from   Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Arrow left icon
Product type Paperback
Published in Feb 2020
Publisher Packt
ISBN-13 9781839213106
Length 626 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Theodore Petrou Theodore Petrou
Author Profile Icon Theodore Petrou
Theodore Petrou
Matthew Harrison Matthew Harrison
Author Profile Icon Matthew Harrison
Matthew Harrison
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Pandas Foundations 2. Essential DataFrame Operations FREE CHAPTER 3. Creating and Persisting DataFrames 4. Beginning Data Analysis 5. Exploratory Data Analysis 6. Selecting Subsets of Data 7. Filtering Rows 8. Index Alignment 9. Grouping for Aggregation, Filtration, and Transformation 10. Restructuring Data into a Tidy Form 11. Combining Pandas Objects 12. Time Series Analysis 13. Visualization with Matplotlib, Pandas, and Seaborn 14. Debugging and Testing Pandas 15. Other Books You May Enjoy
16. Index

Comparing missing values

pandas uses the NumPy NaN (np.nan) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None object evaluates as True when compared to itself:

>>> np.nan == np.nan
False
>>> None == None
True

All other comparisons against np.nan also return False, except not equal to (!=):

>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True

Getting ready

Series and DataFrames use the equals operator, ==, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals method.

As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

How to do it...

  1. To get an idea of how the equals operator works, let's compare each element to a scalar value:
    >>> college_ugds == 0.0019
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...       False       False  ...     False      False
    Universit...       False       False  ...     False      False
    Amridge U...       False       False  ...     False      False
    Universit...       False       False  ...     False      False
    Alabama S...       False       False  ...     False      False
    ...                  ...         ...  ...       ...        ...
    SAE Insti...       False       False  ...     False      False
    Rasmussen...       False       False  ...     False      False
    National ...       False       False  ...     False      False
    Bay Area ...       False       False  ...     False      False
    Excel Lea...       False       False  ...     False      False
    
  2. This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values. You may be tempted to use the equals operator to compare two DataFrames with one another on an element-by-element basis. Take, for instance, college_ugds compared against itself, as follows:
    >>> college_self_compare = college_ugds == college_ugds
    >>> college_self_compare.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...
    Alabama A...        True        True  ...      True       True
    Universit...        True        True  ...      True       True
    Amridge U...        True        True  ...      True       True
    Universit...        True        True  ...      True       True
    Alabama S...        True        True  ...      True       True
    
  3. At first glance, all the values appear to be equal, as you would expect. However, using the .all method to determine if each column contains only True values yields an unexpected result:
    >>> college_self_compare.all()
    UGDS_WHITE    False
    UGDS_BLACK    False
    UGDS_HISP     False
    UGDS_ASIAN    False
    UGDS_AIAN     False
    UGDS_NHPI     False
    UGDS_2MOR     False
    UGDS_NRA      False
    UGDS_UNKN     False
    dtype: bool
    
  4. This happens because missing values do not compare equally with one another. If you tried to count missing values using the equal operator and summing up the Boolean columns, you would get zero for each one:
    >>> (college_ugds == np.nan).sum()
    UGDS_WHITE    0
    UGDS_BLACK    0
    UGDS_HISP     0
    UGDS_ASIAN    0
    UGDS_AIAN     0
    UGDS_NHPI     0
    UGDS_2MOR     0
    UGDS_NRA      0
    UGDS_UNKN     0
    dtype: int64
    
  5. Instead of using == to find missing numbers, use the .isna method:
    >>> college_ugds.isna().sum()
    UGDS_WHITE    661
    UGDS_BLACK    661
    UGDS_HISP     661
    UGDS_ASIAN    661
    UGDS_AIAN     661
    UGDS_NHPI     661
    UGDS_2MOR     661
    UGDS_NRA      661
    UGDS_UNKN     661
    dtype: int64
    
  6. The correct way to compare two entire DataFrames with one another is not with the equals operator (==) but with the .equals method. This method treats NaNs that are in the same location as equal (note that the .eq method is the equivalent of ==):
    >>> college_ugds.equals(college_ugds)
    True
    

How it works...

Step 1 compares a DataFrame to a scalar value while step 2 compares a DataFrame with another DataFrame. Both operations appear to be quite simple and intuitive at first glance. The second operation is checking whether the DataFrames have identically labeled indexes and thus the same number of elements. The operation will fail if this isn't the case.

Step 3 verifies that none of the columns in the DataFrames are equivalent to each other. Step 4 further shows the non-equivalence of np.nan and itself. Step 5 verifies that there are indeed missing values in the DataFrame. Finally, step 6 shows the correct way to compare DataFrames with the .equals method, which always returns a Boolean scalar value.

There's more...

All the comparison operators have method counterparts that allow for more functionality. Somewhat confusingly, the .eq DataFrame method does element-by-element comparison, just like the equals (==) operator. The .eq method is not at all the same as the .equals method. The following code duplicates step 1:

>>> college_ugds.eq(0.0019)  # same as college_ugds == .0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

Inside the pandas.testing sub-package, a function exists that developers should use when creating unit tests. The assert_frame_equal function raises an AssertionError if two DataFrames are not equal. It returns None if the two DataFrames are equal:

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True

Unit tests are a very important part of software development and ensure that the code is running correctly. pandas contains many thousands of unit tests that help ensure that it is running properly. To read more on how pandas runs its unit tests, see the Contributing to pandas section in the documentation (http://bit.ly/2vmCSU6).

You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020
Publisher: Packt
ISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image