You're reading from Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Product type Paperback

Published in Feb 2020

Publisher Packt

ISBN-13 9781839213106

Length 626 pages

Edition 2nd Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

Theodore Petrou

Matthew Harrison

View More author details

Table of Contents (17) Chapters

Preface

1. Pandas Foundations

2. Essential DataFrame Operations FREE CHAPTER

3. Creating and Persisting DataFrames

4. Beginning Data Analysis

5. Exploratory Data Analysis

6. Selecting Subsets of Data

7. Filtering Rows

8. Index Alignment

9. Grouping for Aggregation, Filtration, and Transformation

10. Restructuring Data into a Tidy Form

11. Combining Pandas Objects

12. Time Series Analysis

13. Visualization with Matplotlib, Pandas, and Seaborn

14. Debugging and Testing Pandas

15. Other Books You May Enjoy

16. Index

Summarizing a DataFrame

In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.

In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.

How to do it...

Read in the movie dataset, and examine the basic descriptive properties, .shape, .size, and .ndim, along with running the len function:

>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2
>>> len(movies)
4916

The .count method shows the number of non-missing values for each column. It is an aggregation method as it summarizes every column in a single value. The output is a Series that has the original column names as its index:

>>> movies.count()
color                      4897
director_name              4814
num_critic_for_reviews     4867
duration                   4901
director_facebook_likes    4814
                           ... 
title_year                 4810
actor_2_facebook_likes     4903
imdb_score                 4916
aspect_ratio               4590
movie_facebook_likes       4916
Length: 28, dtype: int64

The other methods that compute summary statistics, .min, .max, .mean, .median, and .std, return Series that have the column names of the numeric columns in the index and their aggregations as the values:

>>> movies.min()
num_critic_for_reviews        1.00
duration                      7.00
director_facebook_likes       0.00
actor_3_facebook_likes        0.00
actor_1_facebook_likes        0.00
                            ...   
title_year                 1916.00
actor_2_facebook_likes        0.00
imdb_score                    1.60
aspect_ratio                  1.18
movie_facebook_likes          0.00
Length: 16, dtype: float64

The .describe method is very powerful and calculates all the descriptive statistics and quartiles at once. The end result is a DataFrame with the descriptive statistics names as its index. I like to transpose the results using .T as I can usually fit more information on the screen that way:

>>> movies.describe().T
               count         mean  ...       75%       max
num_criti...  4867.0   137.988905  ...    191.00     813.0
duration      4901.0   107.090798  ...    118.00     511.0
director_...  4814.0   691.014541  ...    189.75   23000.0
actor_3_f...  4893.0   631.276313  ...    633.00   23000.0
actor_1_f...  4909.0  6494.488491  ...  11000.00  640000.0
...              ...          ...  ...       ...       ...
title_year    4810.0  2002.447609  ...   2011.00    2016.0
actor_2_f...  4903.0  1621.923516  ...    912.00  137000.0
imdb_score    4916.0     6.437429  ...      7.20       9.5
aspect_ratio  4590.0     2.222349  ...      2.35      16.0
movie_fac...  4916.0  7348.294142  ...   2000.00  349000.0

It is possible to specify exact quantiles in the .describe method using the percentiles parameter:

>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
               count         mean  ...       99%       max
num_criti...  4867.0   137.988905  ...    546.68     813.0
duration      4901.0   107.090798  ...    189.00     511.0
director_...  4814.0   691.014541  ...  16000.00   23000.0
actor_3_f...  4893.0   631.276313  ...  11000.00   23000.0
actor_1_f...  4909.0  6494.488491  ...  44920.00  640000.0
...              ...          ...  ...       ...       ...
title_year    4810.0  2002.447609  ...   2016.00    2016.0
actor_2_f...  4903.0  1621.923516  ...  17000.00  137000.0
imdb_score    4916.0     6.437429  ...      8.50       9.5
aspect_ratio  4590.0     2.222349  ...      4.00      16.0
movie_fac...  4916.0  7348.294142  ...  93850.00  349000.0

How it works...

Step 1 gives basic information on the size of the dataset. The .shape attribute returns a tuple with the number of rows and columns. The .size attribute returns the total number of elements in the DataFrame, which is just the product of the number of rows and columns. The .ndim attribute returns the number of dimensions, which is two for all DataFrames. When a DataFrame is passed to the built-in len function, it returns the number of rows.

The methods in step 2 and step 3 aggregate each column down to a single number. Each column name is now the index label in a Series with its aggregated result as the corresponding value.

If you look closely, you will notice that the output from step 3 is missing all the object columns from step 2. This method ignores string columns by default.

Note that numeric columns have missing values but have a result returned by .describe. By default, pandas handles missing values in numeric columns by skipping them. It is possible to change this behavior by setting the skipna parameter to False. This will cause pandas to return NaN for all these aggregation methods if there exists at least a single missing value.

The .describe method displays the summary statistics of the numeric columns. You can expand its summary to include more quantiles by passing a list of numbers between 0 and 1 to the percentiles parameter. See the Developing a data analysis routine recipe for more on the .describe method.

There's more...

To see how the .skipna parameter affects the outcome, we can set its value to False and rerun step 3 from the preceding recipe. Only numeric columns without missing values will calculate a result:

>>> movies.min(skipna=False)
num_critic_for_reviews     NaN
duration                   NaN
director_facebook_likes    NaN
actor_3_facebook_likes     NaN
actor_1_facebook_likes     NaN
                          ... 
title_year                 NaN
actor_2_facebook_likes     NaN
imdb_score                 1.6
aspect_ratio               NaN
movie_facebook_likes       0.0
Length: 16, dtype: float64