Comparing missing values
pandas uses the NumPy NaN (np.nan
) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None
object evaluates as True
when compared to itself:
>>> np.nan == np.nan
False
>>> None == None
True
All other comparisons against np.nan
also return False
, except not equal to (!=
):
>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True
Getting ready
Series and DataFrames use the equals operator, ==
, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals
method.
As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
How to do it...
- To get an idea of how the equals operator works, let's compare each element to a scalar value:
>>> college_ugds == 0.0019 UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... False False ... False False Universit... False False ... False False Amridge U... False False ... False False Universit... False False ... False False Alabama S... False False ... False False ... ... ... ... ... ... SAE Insti... False False ... False False Rasmussen... False False ... False False National ... False False ... False False Bay Area ... False False ... False False Excel Lea... False False ... False False
- This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values. You may be tempted to use the equals operator to compare two DataFrames with one another on an element-by-element basis. Take, for instance,
college_ugds
compared against itself, as follows:>>> college_self_compare = college_ugds == college_ugds >>> college_self_compare.head() UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... True True ... True True Universit... True True ... True True Amridge U... True True ... True True Universit... True True ... True True Alabama S... True True ... True True
- At first glance, all the values appear to be equal, as you would expect. However, using the
.all
method to determine if each column contains onlyTrue
values yields an unexpected result:>>> college_self_compare.all() UGDS_WHITE False UGDS_BLACK False UGDS_HISP False UGDS_ASIAN False UGDS_AIAN False UGDS_NHPI False UGDS_2MOR False UGDS_NRA False UGDS_UNKN False dtype: bool
- This happens because missing values do not compare equally with one another. If you tried to count missing values using the equal operator and summing up the Boolean columns, you would get zero for each one:
>>> (college_ugds == np.nan).sum() UGDS_WHITE 0 UGDS_BLACK 0 UGDS_HISP 0 UGDS_ASIAN 0 UGDS_AIAN 0 UGDS_NHPI 0 UGDS_2MOR 0 UGDS_NRA 0 UGDS_UNKN 0 dtype: int64
- Instead of using
==
to find missing numbers, use the.isna
method:>>> college_ugds.isna().sum() UGDS_WHITE 661 UGDS_BLACK 661 UGDS_HISP 661 UGDS_ASIAN 661 UGDS_AIAN 661 UGDS_NHPI 661 UGDS_2MOR 661 UGDS_NRA 661 UGDS_UNKN 661 dtype: int64
- The correct way to compare two entire DataFrames with one another is not with the equals operator (
==
) but with the.equals
method. This method treats NaNs that are in the same location as equal (note that the.eq
method is the equivalent of==
):>>> college_ugds.equals(college_ugds) True
How it works...
Step 1 compares a DataFrame to a scalar value while step 2 compares a DataFrame with another DataFrame. Both operations appear to be quite simple and intuitive at first glance. The second operation is checking whether the DataFrames have identically labeled indexes and thus the same number of elements. The operation will fail if this isn't the case.
Step 3 verifies that none of the columns in the DataFrames are equivalent to each other. Step 4 further shows the non-equivalence of np.nan
and itself. Step 5 verifies that there are indeed missing values in the DataFrame. Finally, step 6 shows the correct way to compare DataFrames with the .equals
method, which always returns a Boolean scalar value.
There's more...
All the comparison operators have method counterparts that allow for more functionality. Somewhat confusingly, the .eq
DataFrame method does element-by-element comparison, just like the equals (==
) operator. The .eq
method is not at all the same as the .equals
method. The following code duplicates step 1:
>>> college_ugds.eq(0.0019) # same as college_ugds == .0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
Amridge U... False False ... False False
Universit... False False ... False False
Alabama S... False False ... False False
... ... ... ... ... ...
SAE Insti... False False ... False False
Rasmussen... False False ... False False
National ... False False ... False False
Bay Area ... False False ... False False
Excel Lea... False False ... False False
Inside the pandas.testing
sub-package, a function exists that developers should use when creating unit tests. The assert_frame_equal
function raises an AssertionError
if two DataFrames are not equal. It returns None
if the two DataFrames are equal:
>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True
Unit tests are a very important part of software development and ensure that the code is running correctly. pandas contains many thousands of unit tests that help ensure that it is running properly. To read more on how pandas runs its unit tests, see the Contributing to pandas section in the documentation (http://bit.ly/2vmCSU6).