Chaining DataFrame methods
The Chaining Series methods recipe in Chapter 1, Pandas Foundations, showcased several examples of chaining Series methods together. All the method chains in this chapter will begin from a DataFrame. One of the keys to method chaining is to know the exact object being returned during each step of the chain. In pandas, this will nearly always be a DataFrame, Series, or scalar value.
In this recipe, we count all the missing values in each column of the movie dataset.
How to do it...
- We will use the
.isnull
method to get a count of the missing values. This method will change every value to a Boolean, indicating whether it is missing:>>> movies = pd.read_csv("data/movie.csv") >>> def shorten(col): ... return col.replace("facebook_likes", "fb").replace( ... "_for_reviews", "" ... ) >>> movies = movies.rename(columns=shorten) >>> movies.isnull().head() color director_name ... aspect_ratio movie_fb 0 False False ... False False 1 False False ... False False 2 False False ... False False 3 False False ... False False 4 True False ... True False
- We will chain the
.sum
method that interpretsTrue
andFalse
as1
and0
, respectively. Because this is a reduction method, it aggregates the results into a Series:>>> (movies.isnull().sum().head()) color 19 director_name 102 num_critic 49 duration 15 director_fb 102 dtype: int64
- We can go one step further and take the sum of this Series and return the count of the total number of missing values in the entire DataFrame as a scalar value:
>>> movies.isnull().sum().sum() 2654
- A way to determine whether there are any missing values in the DataFrame is to use the
.any
method twice in succession:>>> movies.isnull().any().any() True
How it works...
The .isnull
method returns a DataFrame the same size as the calling DataFrame but with all values transformed to Booleans. See the counts of the following data types to verify this:
>>> movies.isnull().dtypes.value_counts()
bool 28
dtype: int64
In Python, Booleans evaluate to 0 and 1, and this makes it possible to sum them by column, as done in step 2. The resulting Series itself also has a .sum
method, which gets us the grand total of missing values in the DataFrame.
In step 4, the .any
method on a DataFrame returns a Series of Booleans indicating if there exists at least one True
for each column. The .any
method is chained again on this resulting Series of Booleans to determine if any of the columns have missing values. If step 4 evaluates as True
, then there is at least one missing value in the entire DataFrame.
There's more...
Most of the columns in the movie dataset with the object
data type contain missing values. By default, aggregation methods (.min
, .max
, and .sum
), do not return anything for object
columns. as seen in the following code snippet, which selects three object
columns and attempts to find the maximum value of each one:
>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)
To force pandas to return something for each column, we must fill in the missing values. Here, we choose an empty string:
>>> movies.select_dtypes(["object"]).fillna("").max()
color Color
director_name Étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name Óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object
For purposes of readability, method chains are often written as one method call per line surrounded by parentheses. This makes it easier to read and insert comments on what is returned at each step of the chain, or comment out lines to debug what is happening:
>>> (movies.select_dtypes(["object"]).fillna("").max())
color Color
director_name Étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name Óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object