Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Pandas 1.x Cookbook
Pandas 1.x Cookbook

Pandas 1.x Cookbook: Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python , Second Edition

Arrow left icon
Profile Icon Matthew Harrison Profile Icon Theodore Petrou
Arrow right icon
€59.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (28 Ratings)
Paperback Feb 2020 626 pages 2nd Edition
eBook
€8.99 €47.99
Paperback
€59.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Matthew Harrison Profile Icon Theodore Petrou
Arrow right icon
€59.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (28 Ratings)
Paperback Feb 2020 626 pages 2nd Edition
eBook
€8.99 €47.99
Paperback
€59.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.99 €47.99
Paperback
€59.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Pandas 1.x Cookbook

Introduction

This chapter covers many fundamental operations of the DataFrame. Many of the recipes will be similar to those in Chapter 1, Pandas Foundations, which primarily covered operations on a Series.

Selecting multiple DataFrame columns

We can select a single column by passing the column name to the index operator of a DataFrame. This was covered in the Selecting a column recipe in Chapter 1, Pandas Foundations. It is often necessary to focus on a subset of the current working dataset, which is accomplished by selecting multiple columns.

In this recipe, all the actor and director columns will be selected from the movie dataset.

How to do it...

  1. Read in the movie dataset, and pass in a list of the desired columns to the indexing operator:
    >>> import pandas as pd
    >>> import numpy as np
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movie_actor_director = movies[
    ...     [
    ...         "actor_1_name",
    ...         "actor_2_name",
    ...         "actor_3_name",
    ...         "director_name",
    ...     ]
    ... ]
    >>> movie_actor_director.head()
      actor_1_name actor_2_name actor_3_name director_name
    0  CCH Pounder  Joel Dav...    Wes Studi  James Ca...
    1  Johnny Depp  Orlando ...  Jack Dav...  Gore Ver...
    2  Christop...  Rory Kin...  Stephani...   Sam Mendes
    3    Tom Hardy  Christia...  Joseph G...  Christop...
    4  Doug Walker   Rob Walker          NaN  Doug Walker
    
  2. There are instances when one column of a DataFrame needs to be selected. Using the index operation can return either a Series or a DataFrame. If we pass in a list with a single item, we will get back a DataFrame. If we pass in just a string with the column name, we will get a Series back:
    >>> type(movies[["director_name"]])
    <class 'pandas.core.frame.DataFrame'>
    >>> type(movies["director_name"])
    <class 'pandas.core.series.Series'>
    
  3. We can also use .loc to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series:
    >>> type(movies.loc[:, ["director_name"]])
    <class 'pandas.core.frame.DataFrame'>
    >>> type(movies.loc[:, "director_name"])
    <class 'pandas.core.series.Series'>
    

How it works...

The DataFrame index operator is very flexible and capable of accepting a number of different objects. If a string is passed, it will return a single-dimensional Series. If a list is passed to the indexing operator, it returns a DataFrame of all the columns in the list in the specified order.

Step 2 shows how to select a single column as a DataFrame and as a Series. Usually, a single column is selected with a string, resulting in a Series. When a DataFrame is desired, put the column name in a single-element list.

Step 3 shows how to use the loc attribute to pull out a Series or a DataFrame.

There's more...

Passing a long list inside the indexing operator might cause readability issues. To help with this, you may save all your column names to a list variable first. The following code achieves the same result as step 1:

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movie_actor_director = movies[cols]

One of the most common exceptions raised when working with pandas is KeyError. This error is mainly due to mistyping of a column or index name. This same error is raised whenever a multiple column selection is attempted without the use of a list:

>>> movies[
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
Traceback (most recent call last):
  ...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

Selecting columns with methods

Although column selection is usually done with the indexing operator, there are some DataFrame methods that facilitate their selection in an alternative manner. The .select_dtypes and .filter methods are two useful methods to do this.

If you want to select by type, you need to be familiar with pandas data types. The Understanding data types recipe in Chapter 1, Pandas Foundations, explains the types.

How to do it...

  1. Read in the movie dataset. Shorten the column names for display. Use the .get_dtype_counts method to output the number of columns with each specific data type:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> def shorten(col):
    ...     return (
    ...         str(col)
    ...         .replace("facebook_likes", "fb")
    ...         .replace("_for_reviews", "")
    ...     )
    >>> movies = movies.rename(columns=shorten)
    >>> movies.dtypes.value_counts()
    float64    13
    int64       3
    object     12
    dtype: int64
    
  2. Use the .select_dtypes method to select only the integer columns:
    >>> movies.select_dtypes(include="int").head()
       num_voted_users  cast_total_fb  movie_fb
    0           886204           4834     33000
    1           471220          48350         0
    2           275868          11700     85000
    3          1144337         106759    164000
    4                8            143         0
    
  3. If you would like to select all the numeric columns, you may pass the string number to the include parameter:
    >>> movies.select_dtypes(include="number").head()
       num_critics  duration  ...  aspect_ratio  movie_fb
    0        723.0     178.0  ...         1.78      33000
    1        302.0     169.0  ...         2.35          0
    2        602.0     148.0  ...         2.35      85000
    3        813.0     164.0  ...         2.35     164000
    4          NaN       NaN  ...          NaN          0
    
  4. If we wanted integer and string columns we could do the following:
    >>> movies.select_dtypes(include=["int", "object"]).head()
       color        direc/_name  ... conte/ating movie_fb
    0  Color      James Cameron  ...       PG-13    33000
    1  Color     Gore Verbinski  ...       PG-13        0
    2  Color         Sam Mendes  ...       PG-13    85000
    3  Color  Christopher Nolan  ...       PG-13   164000
    4    NaN        Doug Walker  ...         NaN        0
    
  5. To exclude only floating-point columns, do the following:
    >>> movies.select_dtypes(exclude="float").head()
       color director_name  ... content_rating movie_fb
    0  Color  James Ca...   ...        PG-13      33000
    1  Color  Gore Ver...   ...        PG-13          0
    2  Color   Sam Mendes   ...        PG-13      85000
    3  Color  Christop...   ...        PG-13     164000
    4    NaN  Doug Walker   ...          NaN          0
    
  6. An alternative method to select columns is with the .filter method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use the like parameter to search for all the Facebook columns or the names that contain the exact string, fb. The like parameter is checking for substrings in column names:
    >>> movies.filter(like="fb").head()
       director_fb  actor_3_fb  ...  actor_2_fb  movie_fb
    0          0.0       855.0  ...       936.0     33000
    1        563.0      1000.0  ...      5000.0         0
    2          0.0       161.0  ...       393.0     85000
    3      22000.0     23000.0  ...     23000.0    164000
    4        131.0         NaN  ...        12.0         0
    
  7. The .filter method has more tricks (or parameters) up its sleeve. If you use the items parameters, you can pass in a list of column names:
    >>> cols = [
    ...     "actor_1_name",
    ...     "actor_2_name",
    ...     "actor_3_name",
    ...     "director_name",
    ... ]
    >>> movies.filter(items=cols).head()
          actor_1_name  ...      director_name
    0      CCH Pounder  ...      James Cameron
    1      Johnny Depp  ...     Gore Verbinski
    2  Christoph Waltz  ...         Sam Mendes
    3        Tom Hardy  ...  Christopher Nolan
    4      Doug Walker  ...        Doug Walker
    
  8. The .filter method allows columns to be searched with regular expressions using the regex parameter. Here, we search for all columns that have a digit somewhere in their name:
    >>> movies.filter(regex=r"\d").head()
       actor_3_fb actor_2_name  ...  actor_3_name actor_2_fb
    0       855.0  Joel Dav...  ...    Wes Studi       936.0
    1      1000.0  Orlando ...  ...  Jack Dav...      5000.0
    2       161.0  Rory Kin...  ...  Stephani...       393.0
    3     23000.0  Christia...  ...  Joseph G...     23000.0
    4         NaN   Rob Walker  ...          NaN        12.0
    

How it works...

Step 1 lists the frequencies of all the different data types. Alternatively, you may use the .dtypes attribute to get the exact data type for each column. The .select_dtypes method accepts either a list or single data type in its include or exclude parameters and returns a DataFrame with columns of just those given data types (or not those types if excluding columns). The list values may be either the string name of the data type or the actual Python object.

The .filter method selects columns by only inspecting the column names and not the actual data values. It has three mutually exclusive parameters: items, like, and regex, only one of which can be used at a time.

The like parameter takes a string and attempts to find all the column names that contain that exact string somewhere in the name. To gain more flexibility, you may use the regex parameter instead to select column names through a regular expression. This particular regular expression, r'\d', represents all digits from zero to nine and matches any string with at least a single digit in it.

The filter method comes with another parameter, items, which takes a list of exact column names. This is nearly an exact duplication of the index operation, except that a KeyError will not be raised if one of the strings does not match a column name. For instance, movies.filter(items=['actor_1_name', 'asdf']) runs without error and returns a single column DataFrame.

There's more...

One confusing aspect of .select_dtypes is its flexibility to take both strings and Python objects. The following list should clarify all the possible ways to select the many different column data types. There is no standard or preferred method of referring to data types in pandas, so it's good to be aware of both ways:

  • np.number, 'number' – Selects both integers and floats regardless of size
  • np.float64, np.float_, float, 'float64', 'float_', 'float' – Selects only 64-bit floats
  • np.float16, np.float32, np.float128, 'float16', 'float32', 'float128' – Respectively selects exactly 16, 32, and 128-bit floats
  • np.floating, 'floating' – Selects all floats regardless of size
  • np.int0, np.int64, np.int_, int, 'int0', 'int64', 'int_', 'int' – Selects only 64-bit integers
  • np.int8, np.int16, np.int32, 'int8', 'int16', 'int32' – Respectively selects exactly 8, 16, and 32-bit integers
  • np.integer, 'integer' – Selects all integers regardless of size
  • 'Int64' – Selects nullable integer; no NumPy equivalent
  • np.object, 'object', 'O' – Select all object data types
  • np.datetime64, 'datetime64', 'datetime' – All datetimes are 64 bits
  • np.timedelta64, 'timedelta64', 'timedelta' – All timedeltas are 64 bits
  • pd.Categorical, 'category' – Unique to pandas; no NumPy equivalent

Because all integers and floats default to 64 bits, you may select them by using the string 'int' or 'float' as you can see from the preceding bullet list. If you want to select all integers and floats regardless of their specific size, use the string 'number'.

Ordering column names

One of the first tasks to consider after initially importing a dataset as a DataFrame is to analyze the order of the columns. As humans we are used to reading languages from left to right, which impacts our interpretations of the data. It's far easier to find and interpret information when column order is given consideration.

There are no standardized set of rules that dictate how columns should be organized within a dataset. However, it is good practice to develop a set of guidelines that you consistently follow. This is especially true if you work with a group of analysts who share lots of datasets.

The following is a guideline to order columns:

  • Classify each column as either categorical or continuous
  • Group common columns within the categorical and continuous columns
  • Place the most important groups of columns first with categorical columns before continuous ones

This recipe shows you how to order the columns with this guideline. There are many possible orderings that are sensible.

How to do it...

  1. Read in the movie dataset, and scan the data:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> def shorten(col):
    ...     return col.replace("facebook_likes", "fb").replace(
    ...         "_for_reviews", ""
    ...     )
    >>> movies = movies.rename(columns=shorten)
    
  2. Output all the column names and scan for similar categorical and continuous columns:
    >>> movies.columns
    Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
           'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
           'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
           'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
           'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
           'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
           'movie_fb'],
          dtype='object')
    
  3. The columns don't appear to have any logical ordering to them. Organize the names sensibly into lists so that the guideline from the previous section is followed:
    >>> cat_core = [
    ...     "movie_title",
    ...     "title_year",
    ...     "content_rating",
    ...     "genres",
    ... ]
    >>> cat_people = [
    ...     "director_name",
    ...     "actor_1_name",
    ...     "actor_2_name",
    ...     "actor_3_name",
    ... ]
    >>> cat_other = [
    ...     "color",
    ...     "country",
    ...     "language",
    ...     "plot_keywords",
    ...     "movie_imdb_link",
    ... ]
    >>> cont_fb = [
    ...     "director_fb",
    ...     "actor_1_fb",
    ...     "actor_2_fb",
    ...     "actor_3_fb",
    ...     "cast_total_fb",
    ...     "movie_fb",
    ... ]
    >>> cont_finance = ["budget", "gross"]
    >>> cont_num_reviews = [
    ...     "num_voted_users",
    ...     "num_user",
    ...     "num_critic",
    ... ]
    >>> cont_other = [
    ...     "imdb_score",
    ...     "duration",
    ...     "aspect_ratio",
    ...     "facenumber_in_poster",
    ... ]
    
  4. Concatenate all the lists together to get the final column order. Also, ensure that this list contains all the columns from the original:
    >>> new_col_order = (
    ...     cat_core
    ...     + cat_people
    ...     + cat_other
    ...     + cont_fb
    ...     + cont_finance
    ...     + cont_num_reviews
    ...     + cont_other
    ... )
    >>> set(movies.columns) == set(new_col_order)
    True
    
  5. Pass the list with the new column order to the indexing operator of the DataFrame to reorder the columns:
    >>> movies[new_col_order].head()
       movie_title  title_year  ... aspect_ratio facenumber_in_poster
    0       Avatar      2009.0  ...         1.78          0.0
    1  Pirates ...      2007.0  ...         2.35          0.0
    2      Spectre      2015.0  ...         2.35          1.0
    3  The Dark...      2012.0  ...         2.35          0.0
    4  Star War...         NaN  ...          NaN          0.0
    

How it works...

You can select a subset of columns from a DataFrame, with a list of specific column names. For instance, movies[['movie_title', 'director_name']] creates a new DataFrame with only the movie_title and director_name columns. Selecting columns by name is the default behavior of the index operator for a pandas DataFrame.

Step 3 neatly organizes all of the column names into separate lists based on their type (categorical or continuous) and by how similar their data is. The most important columns, such as the title of the movie, are placed first.

Step 4 concatenates all of the lists of column names and validates that this new list contains the same exact values as the original column names. Python sets are unordered and the equality statement checks whether each member of one set is a member of the other. Manually ordering columns in this recipe is susceptible to human error as it's easy to mistakenly forget a column in the new column list.

Step 5 completes the reordering by passing the new column order as a list to the indexing operator. This new order is now much more sensible than the original.

There's more...

There are alternative guidelines for ordering columns besides the suggestion mentioned earlier. Hadley Wickham's seminal paper on Tidy Data suggests placing the fixed variables first, followed by measured variables. As this data does not come from a controlled experiment, there is some flexibility in determining which variables are fixed and which ones are measured. Good candidates for measured variables are those that we would like to predict, such as gross, the budget, or the imdb_score. For instance, in this ordering, we can mix categorical and continuous variables. It might make more sense to place the column for the number of Facebook likes directly after the name of that actor. You can, of course, come up with your own guidelines for column order as the computational parts are unaffected by it.

Summarizing a DataFrame

In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.

In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.

How to do it...

  1. Read in the movie dataset, and examine the basic descriptive properties, .shape, .size, and .ndim, along with running the len function:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movies.shape
    (4916, 28)
    >>> movies.size
    137648
    >>> movies.ndim
    2
    >>> len(movies)
    4916
    
  2. The .count method shows the number of non-missing values for each column. It is an aggregation method as it summarizes every column in a single value. The output is a Series that has the original column names as its index:
    >>> movies.count()
    color                      4897
    director_name              4814
    num_critic_for_reviews     4867
    duration                   4901
    director_facebook_likes    4814
                               ... 
    title_year                 4810
    actor_2_facebook_likes     4903
    imdb_score                 4916
    aspect_ratio               4590
    movie_facebook_likes       4916
    Length: 28, dtype: int64
    
  3. The other methods that compute summary statistics, .min, .max, .mean, .median, and .std, return Series that have the column names of the numeric columns in the index and their aggregations as the values:
    >>> movies.min()
    num_critic_for_reviews        1.00
    duration                      7.00
    director_facebook_likes       0.00
    actor_3_facebook_likes        0.00
    actor_1_facebook_likes        0.00
                                ...   
    title_year                 1916.00
    actor_2_facebook_likes        0.00
    imdb_score                    1.60
    aspect_ratio                  1.18
    movie_facebook_likes          0.00
    Length: 16, dtype: float64
    
  4. The .describe method is very powerful and calculates all the descriptive statistics and quartiles at once. The end result is a DataFrame with the descriptive statistics names as its index. I like to transpose the results using .T as I can usually fit more information on the screen that way:
    >>> movies.describe().T
                   count         mean  ...       75%       max
    num_criti...  4867.0   137.988905  ...    191.00     813.0
    duration      4901.0   107.090798  ...    118.00     511.0
    director_...  4814.0   691.014541  ...    189.75   23000.0
    actor_3_f...  4893.0   631.276313  ...    633.00   23000.0
    actor_1_f...  4909.0  6494.488491  ...  11000.00  640000.0
    ...              ...          ...  ...       ...       ...
    title_year    4810.0  2002.447609  ...   2011.00    2016.0
    actor_2_f...  4903.0  1621.923516  ...    912.00  137000.0
    imdb_score    4916.0     6.437429  ...      7.20       9.5
    aspect_ratio  4590.0     2.222349  ...      2.35      16.0
    movie_fac...  4916.0  7348.294142  ...   2000.00  349000.0
    
  5. It is possible to specify exact quantiles in the .describe method using the percentiles parameter:
    >>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
                   count         mean  ...       99%       max
    num_criti...  4867.0   137.988905  ...    546.68     813.0
    duration      4901.0   107.090798  ...    189.00     511.0
    director_...  4814.0   691.014541  ...  16000.00   23000.0
    actor_3_f...  4893.0   631.276313  ...  11000.00   23000.0
    actor_1_f...  4909.0  6494.488491  ...  44920.00  640000.0
    ...              ...          ...  ...       ...       ...
    title_year    4810.0  2002.447609  ...   2016.00    2016.0
    actor_2_f...  4903.0  1621.923516  ...  17000.00  137000.0
    imdb_score    4916.0     6.437429  ...      8.50       9.5
    aspect_ratio  4590.0     2.222349  ...      4.00      16.0
    movie_fac...  4916.0  7348.294142  ...  93850.00  349000.0
    

How it works...

Step 1 gives basic information on the size of the dataset. The .shape attribute returns a tuple with the number of rows and columns. The .size attribute returns the total number of elements in the DataFrame, which is just the product of the number of rows and columns. The .ndim attribute returns the number of dimensions, which is two for all DataFrames. When a DataFrame is passed to the built-in len function, it returns the number of rows.

The methods in step 2 and step 3 aggregate each column down to a single number. Each column name is now the index label in a Series with its aggregated result as the corresponding value.

If you look closely, you will notice that the output from step 3 is missing all the object columns from step 2. This method ignores string columns by default.

Note that numeric columns have missing values but have a result returned by .describe. By default, pandas handles missing values in numeric columns by skipping them. It is possible to change this behavior by setting the skipna parameter to False. This will cause pandas to return NaN for all these aggregation methods if there exists at least a single missing value.

The .describe method displays the summary statistics of the numeric columns. You can expand its summary to include more quantiles by passing a list of numbers between 0 and 1 to the percentiles parameter. See the Developing a data analysis routine recipe for more on the .describe method.

There's more...

To see how the .skipna parameter affects the outcome, we can set its value to False and rerun step 3 from the preceding recipe. Only numeric columns without missing values will calculate a result:

>>> movies.min(skipna=False)
num_critic_for_reviews     NaN
duration                   NaN
director_facebook_likes    NaN
actor_3_facebook_likes     NaN
actor_1_facebook_likes     NaN
                          ... 
title_year                 NaN
actor_2_facebook_likes     NaN
imdb_score                 1.6
aspect_ratio               NaN
movie_facebook_likes       0.0
Length: 16, dtype: float64

Chaining DataFrame methods

The Chaining Series methods recipe in Chapter 1, Pandas Foundations, showcased several examples of chaining Series methods together. All the method chains in this chapter will begin from a DataFrame. One of the keys to method chaining is to know the exact object being returned during each step of the chain. In pandas, this will nearly always be a DataFrame, Series, or scalar value.

In this recipe, we count all the missing values in each column of the movie dataset.

How to do it...

  1. We will use the .isnull method to get a count of the missing values. This method will change every value to a Boolean, indicating whether it is missing:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> def shorten(col):
    ...     return col.replace("facebook_likes", "fb").replace(
    ...         "_for_reviews", ""
    ...     )
    >>> movies = movies.rename(columns=shorten)
    >>> movies.isnull().head()
       color  director_name  ...  aspect_ratio  movie_fb
    0  False        False    ...        False      False
    1  False        False    ...        False      False
    2  False        False    ...        False      False
    3  False        False    ...        False      False
    4   True        False    ...         True      False
    
  2. We will chain the .sum method that interprets True and False as 1 and 0, respectively. Because this is a reduction method, it aggregates the results into a Series:
    >>> (movies.isnull().sum().head())
    color             19
    director_name    102
    num_critic        49
    duration          15
    director_fb      102
    dtype: int64
    
  3. We can go one step further and take the sum of this Series and return the count of the total number of missing values in the entire DataFrame as a scalar value:
    >>> movies.isnull().sum().sum()
    2654
    
  4. A way to determine whether there are any missing values in the DataFrame is to use the .any method twice in succession:
    >>> movies.isnull().any().any()
    True
    

How it works...

The .isnull method returns a DataFrame the same size as the calling DataFrame but with all values transformed to Booleans. See the counts of the following data types to verify this:

>>> movies.isnull().dtypes.value_counts()
bool    28
dtype: int64

In Python, Booleans evaluate to 0 and 1, and this makes it possible to sum them by column, as done in step 2. The resulting Series itself also has a .sum method, which gets us the grand total of missing values in the DataFrame.

In step 4, the .any method on a DataFrame returns a Series of Booleans indicating if there exists at least one True for each column. The .any method is chained again on this resulting Series of Booleans to determine if any of the columns have missing values. If step 4 evaluates as True, then there is at least one missing value in the entire DataFrame.

There's more...

Most of the columns in the movie dataset with the object data type contain missing values. By default, aggregation methods (.min, .max, and .sum), do not return anything for object columns. as seen in the following code snippet, which selects three object columns and attempts to find the maximum value of each one:

>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)

To force pandas to return something for each column, we must fill in the missing values. Here, we choose an empty string:

>>> movies.select_dtypes(["object"]).fillna("").max()
color                            Color
director_name            Étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             Óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object

For purposes of readability, method chains are often written as one method call per line surrounded by parentheses. This makes it easier to read and insert comments on what is returned at each step of the chain, or comment out lines to debug what is happening:

>>> (movies.select_dtypes(["object"]).fillna("").max())
color                            Color
director_name            Étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             Óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object

DataFrame operations

A primer on operators was given in the Series operations recipe from Chapter 1, Pandas Foundations, which will be helpful here. The Python arithmetic and comparison operators work with DataFrames, as they do with Series.

When an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operation applied to it. Typically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homogeneous data, then the operation is likely to fail. Let's see an example of this failure with the college dataset, which contains both numeric and object data types. Attempting to add 5 to each value of the DataFrame raises a TypeError as integers cannot be added to strings:

>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
  ...
TypeError: can only concatenate str (not "int") to str

To successfully use an operator with a DataFrame, first select homogeneous data. For this recipe, we will select all the columns that begin with 'UGDS_'. These columns represent the fraction of undergraduate students by race. To get started, we import the data and use the institution name as the label for our index, and then select the columns we desire with the .filter method:

>>> colleges = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

This recipe uses multiple operators with a DataFrame to round the undergraduate columns to the nearest hundredth. We will then see how this result is equivalent to the .round method.

How to do it...

  1. pandas does bankers rounding, numbers that are exactly halfway between either side to the even side. Look at what happens to the UGDS_BLACK row of this series when we round it to two decimal places:
    >>> name = "Northwest-Shoals Community College"
    >>> college_ugds.loc[name]
    UGDS_WHITE    0.7912
    UGDS_BLACK    0.1250
    UGDS_HISP     0.0339
    UGDS_ASIAN    0.0036
    UGDS_AIAN     0.0088
    UGDS_NHPI     0.0006
    UGDS_2MOR     0.0012
    UGDS_NRA      0.0033
    UGDS_UNKN     0.0324
    Name: Northwest-Shoals Community College, dtype: float64
    >>> college_ugds.loc[name].round(2)
    UGDS_WHITE    0.79
    UGDS_BLACK    0.12
    UGDS_HISP     0.03
    UGDS_ASIAN    0.00
    UGDS_AIAN     0.01
    UGDS_NHPI     0.00
    UGDS_2MOR     0.00
    UGDS_NRA      0.00
    UGDS_UNKN     0.03
    Name: Northwest-Shoals Community College, dtype: float64
    

    If we add .0001 before rounding, it changes to rounding up:

    >>> (college_ugds.loc[name] + 0.0001).round(2)
    UGDS_WHITE    0.79
    UGDS_BLACK    0.13
    UGDS_HISP     0.03
    UGDS_ASIAN    0.00
    UGDS_AIAN     0.01
    UGDS_NHPI     0.00
    UGDS_2MOR     0.00
    UGDS_NRA      0.00
    UGDS_UNKN     0.03
    Name: Northwest-Shoals Community College, dtype: float64
    
  2. Let's do this to the DataFrame. To begin our rounding adventure with operators, we will first add .00501 to each value of college_ugds:
    >>> college_ugds + 0.00501
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...     0.03831     0.94031  ...   0.01091    0.01881
    Universit...     0.59721     0.26501  ...   0.02291    0.01501
    Amridge U...     0.30401     0.42421  ...   0.00501    0.27651
    Universit...     0.70381     0.13051  ...   0.03821    0.04001
    Alabama S...     0.02081     0.92581  ...   0.02931    0.01871
    ...                  ...         ...  ...       ...        ...
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  3. Use the floor division operator, //, to round down to the nearest whole number percentage:
    >>> (college_ugds + 0.00501) // 0.01
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...         3.0        94.0  ...       1.0        1.0
    Universit...        59.0        26.0  ...       2.0        1.0
    Amridge U...        30.0        42.0  ...       0.0       27.0
    Universit...        70.0        13.0  ...       3.0        4.0
    Alabama S...         2.0        92.0  ...       2.0        1.0
    ...                  ...         ...  ...       ...        ...
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  4. To complete the rounding exercise, divide by 100:
    >>> college_ugds_op_round =(
    ...     (college_ugds + 0.00501) // 0.01 / 100
    ... )
    >>> college_ugds_op_round.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...        0.03        0.94  ...      0.01       0.01
    Universit...        0.59        0.26  ...      0.02       0.01
    Amridge U...        0.30        0.42  ...      0.00       0.27
    Universit...        0.70        0.13  ...      0.03       0.04
    Alabama S...        0.02        0.92  ...      0.02       0.01
    
  5. Now use the round DataFrame method to do the rounding automatically for us. Due to bankers rounding, we add a small fraction before rounding:
    >>> college_ugds_round = (college_ugds + 0.00001).round(2)
    >>> college_ugds_round
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...        0.03        0.94  ...      0.01       0.01
    Universit...        0.59        0.26  ...      0.02       0.01
    Amridge U...        0.30        0.42  ...      0.00       0.27
    Universit...        0.70        0.13  ...      0.03       0.04
    Alabama S...        0.02        0.92  ...      0.02       0.01
    ...                  ...         ...  ...       ...        ....
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  6. Use the equals DataFrame method to test the equality of two DataFrames:
    >>> college_ugds_op_round.equals(college_ugds_round)
    True
    

How it works...

Steps 1 and 2 use the plus operator, which attempts to add a scalar value to each value of each column of the DataFrame. As the columns are all numeric, this operation works as expected. There are some missing values in each of the columns but they stay missing after the operation.

Mathematically, adding .005 should be enough so that the floor division in the next step correctly rounds to the nearest whole percentage. The trouble appears because of the inexactness of floating-point numbers:

>>> 0.045 + 0.005
0.049999999999999996

There is an extra .00001 added to each number to ensure that the floating-point representation has the first four digits the same as the actual value. This works because the maximum precision of all the points in the dataset is four decimal places.

Step 3 applies the floor division operator, //, to all the values in the DataFrame. As we are dividing by a fraction, in essence, it is multiplying each value by 100 and truncating any decimals. Parentheses are needed around the first part of the expression, as floor division has higher precedence than addition. Step 4 uses the division operator to return the decimal to the correct position.

In step 5, we reproduce the previous steps with the round method. Before we can do this, we must again add an extra .00001 to each DataFrame value for a different reason from step 2. NumPy and Python 3 round numbers that are exactly halfway between either side to the even number. The bankers rounding (or ties to even http://bit.ly/2x3V5TU) technique is not usually what is formally taught in schools. It does not consistently bias numbers to the higher side (http://bit.ly/2zhsPy8).

It is necessary here to round up so that both DataFrame values are equal. The .equals method determines if all the elements and indexes between two DataFrames are exactly the same and returns a Boolean.

There's more...

Just as with Series, DataFrames have method equivalents of the operators. You may replace the operators with their method equivalents:

>>> college2 = (
...     college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True

Comparing missing values

pandas uses the NumPy NaN (np.nan) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None object evaluates as True when compared to itself:

>>> np.nan == np.nan
False
>>> None == None
True

All other comparisons against np.nan also return False, except not equal to (!=):

>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True

Getting ready

Series and DataFrames use the equals operator, ==, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals method.

As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

How to do it...

  1. To get an idea of how the equals operator works, let's compare each element to a scalar value:
    >>> college_ugds == 0.0019
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...       False       False  ...     False      False
    Universit...       False       False  ...     False      False
    Amridge U...       False       False  ...     False      False
    Universit...       False       False  ...     False      False
    Alabama S...       False       False  ...     False      False
    ...                  ...         ...  ...       ...        ...
    SAE Insti...       False       False  ...     False      False
    Rasmussen...       False       False  ...     False      False
    National ...       False       False  ...     False      False
    Bay Area ...       False       False  ...     False      False
    Excel Lea...       False       False  ...     False      False
    
  2. This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values. You may be tempted to use the equals operator to compare two DataFrames with one another on an element-by-element basis. Take, for instance, college_ugds compared against itself, as follows:
    >>> college_self_compare = college_ugds == college_ugds
    >>> college_self_compare.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...
    Alabama A...        True        True  ...      True       True
    Universit...        True        True  ...      True       True
    Amridge U...        True        True  ...      True       True
    Universit...        True        True  ...      True       True
    Alabama S...        True        True  ...      True       True
    
  3. At first glance, all the values appear to be equal, as you would expect. However, using the .all method to determine if each column contains only True values yields an unexpected result:
    >>> college_self_compare.all()
    UGDS_WHITE    False
    UGDS_BLACK    False
    UGDS_HISP     False
    UGDS_ASIAN    False
    UGDS_AIAN     False
    UGDS_NHPI     False
    UGDS_2MOR     False
    UGDS_NRA      False
    UGDS_UNKN     False
    dtype: bool
    
  4. This happens because missing values do not compare equally with one another. If you tried to count missing values using the equal operator and summing up the Boolean columns, you would get zero for each one:
    >>> (college_ugds == np.nan).sum()
    UGDS_WHITE    0
    UGDS_BLACK    0
    UGDS_HISP     0
    UGDS_ASIAN    0
    UGDS_AIAN     0
    UGDS_NHPI     0
    UGDS_2MOR     0
    UGDS_NRA      0
    UGDS_UNKN     0
    dtype: int64
    
  5. Instead of using == to find missing numbers, use the .isna method:
    >>> college_ugds.isna().sum()
    UGDS_WHITE    661
    UGDS_BLACK    661
    UGDS_HISP     661
    UGDS_ASIAN    661
    UGDS_AIAN     661
    UGDS_NHPI     661
    UGDS_2MOR     661
    UGDS_NRA      661
    UGDS_UNKN     661
    dtype: int64
    
  6. The correct way to compare two entire DataFrames with one another is not with the equals operator (==) but with the .equals method. This method treats NaNs that are in the same location as equal (note that the .eq method is the equivalent of ==):
    >>> college_ugds.equals(college_ugds)
    True
    

How it works...

Step 1 compares a DataFrame to a scalar value while step 2 compares a DataFrame with another DataFrame. Both operations appear to be quite simple and intuitive at first glance. The second operation is checking whether the DataFrames have identically labeled indexes and thus the same number of elements. The operation will fail if this isn't the case.

Step 3 verifies that none of the columns in the DataFrames are equivalent to each other. Step 4 further shows the non-equivalence of np.nan and itself. Step 5 verifies that there are indeed missing values in the DataFrame. Finally, step 6 shows the correct way to compare DataFrames with the .equals method, which always returns a Boolean scalar value.

There's more...

All the comparison operators have method counterparts that allow for more functionality. Somewhat confusingly, the .eq DataFrame method does element-by-element comparison, just like the equals (==) operator. The .eq method is not at all the same as the .equals method. The following code duplicates step 1:

>>> college_ugds.eq(0.0019)  # same as college_ugds == .0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

Inside the pandas.testing sub-package, a function exists that developers should use when creating unit tests. The assert_frame_equal function raises an AssertionError if two DataFrames are not equal. It returns None if the two DataFrames are equal:

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True

Unit tests are a very important part of software development and ensure that the code is running correctly. pandas contains many thousands of unit tests that help ensure that it is running properly. To read more on how pandas runs its unit tests, see the Contributing to pandas section in the documentation (http://bit.ly/2vmCSU6).

Transposing the direction of a DataFrame operation

Many DataFrame methods have an axis parameter. This parameter controls the direction in which the operation takes place. Axis parameters can be 'index' (or 0) or 'columns' (or 1). I prefer the string versions are they are more explicit and tend to make the code easier to read.

Nearly all DataFrame methods default the axis parameter to 0, which applies to operations along the index. This recipe shows you how to invoke the same method along both axes.

How to do it...

  1. Read in the college dataset; the columns that begin with UGDS represent the percentage of the undergraduate students of a particular race. Use the filter method to select these columns:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = college.filter(like="UGDS_")
    >>> college_ugds.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
    Universit...      0.5922      0.2600  ...    0.0179     0.0100
    Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
    Universit...      0.6988      0.1255  ...    0.0332     0.0350
    Alabama S...      0.0158      0.9208  ...    0.0243     0.0137
    
  2. Now that the DataFrame contains homogenous column data, operations can be sensibly done both vertically and horizontally. The .count method returns the number of non-missing values. By default, its axis parameter is set to 0:
    >>> college_ugds.count()
    UGDS_WHITE    6874
    UGDS_BLACK    6874
    UGDS_HISP     6874
    UGDS_ASIAN    6874
    UGDS_AIAN     6874
    UGDS_NHPI     6874
    UGDS_2MOR     6874
    UGDS_NRA      6874
    UGDS_UNKN     6874
    dtype: int64
    

    The axis parameter is almost always set to 0. So, step 2 is equivalent to both college_ugds.count(axis=0) and college_ugds.count(axis='index').

  3. Changing the axis parameter to 'columns' changes the direction of the operation so that we get back a count of non-missing items in each row:
    >>> college_ugds.count(axis="columns").head()
    INSTNM
    Alabama A & M University               9
    University of Alabama at Birmingham    9
    Amridge University                     9
    University of Alabama in Huntsville    9
    Alabama State University               9
    dtype: int64
    
  4. Instead of counting non-missing values, we can sum all the values in each row. Each row of percentages should add up to 1. The .sum method may be used to verify this:
    >>> college_ugds.sum(axis="columns").head()
    INSTNM
    Alabama A & M University               1.0000
    University of Alabama at Birmingham    0.9999
    Amridge University                     1.0000
    University of Alabama in Huntsville    1.0000
    Alabama State University               1.0000
    dtype: float64
    
  5. To get an idea of the distribution of each column, the .median method can be used:
    >>> college_ugds.median(axis="index")
    UGDS_WHITE    0.55570
    UGDS_BLACK    0.10005
    UGDS_HISP     0.07140
    UGDS_ASIAN    0.01290
    UGDS_AIAN     0.00260
    UGDS_NHPI     0.00000
    UGDS_2MOR     0.01750
    UGDS_NRA      0.00000
    UGDS_UNKN     0.01430
    dtype: float64
    

How it works...

The direction of operation on the axis is one of the more confusing aspects of pandas. Many pandas users have difficulty remembering the meaning of the axis parameter. I remember them by reminding myself that a Series only has one axis, the index (or 0). A DataFrame also has an index (axis 0) and columns (axis 1).

There's more...

The .cumsum method with axis=1 accumulates the race percentages across each row. It gives a slightly different view of the data. For example, it is very easy to see the exact percentage of white and black students for each school:

>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9686  ...    0.9862     1.0000
Universit...      0.5922      0.8522  ...    0.9899     0.9999
Amridge U...      0.2990      0.7182  ...    0.7285     1.0000
Universit...      0.6988      0.8243  ...    0.9650     1.0000
Alabama S...      0.0158      0.9366  ...    0.9863     1.0000

Determining college campus diversity

Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:

>>> pd.read_csv(
...     "data/college_diversity.csv", index_col="School"
... )
                                                   Diversity Index
School
Rutgers University--Newark  Newark, NJ                        0.76
Andrews University  Berrien Springs, MI                       0.74
Stanford University  Stanford, CA                             0.74
University of Houston  Houston, TX                            0.74
University of Nevada--Las Vegas  Las Vegas, NV                0.74
University of San Francisco  San Francisco, CA                0.74
San Francisco State University  San Francisco, CA             0.73
University of Illinois--Chicago  Chicago, IL                  0.73
New Jersey Institute of Technology  Newark, NJ                0.72
Texas Woman's University  Denton, TX                          0.72

Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.

How to do it...

  1. Read in the college dataset, and filter for just the undergraduate race columns:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = college.filter(like="UGDS_")
    
  2. Many of these colleges have missing values for all their race columns. We can count all the missing values for each row and sort the resulting Series from the highest to lowest. This will reveal the colleges that have missing values:
    >>> (
    ...     college_ugds.isnull()
    ...     .sum(axis="columns")
    ...     .sort_values(ascending=False)
    ...     .head()
    ... )
    INSTNM
    Excel Learning Center-San Antonio South         9
    Philadelphia College of Osteopathic Medicine    9
    Assemblies of God Theological Seminary          9
    Episcopal Divinity School                       9
    Phillips Graduate Institute                     9
    dtype: int64
    
  3. Now that we have seen the colleges that are missing all their race columns, we can use the .dropna method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values:
    >>> college_ugds = college_ugds.dropna(how="all")
    >>>; college_ugds.isnull().sum()
    UGDS_WHITE    0
    UGDS_BLACK    0
    UGDS_HISP     0
    UGDS_ASIAN    0
    UGDS_AIAN     0
    UGDS_NHPI     0
    UGDS_2MOR     0
    UGDS_NRA      0
    UGDS_UNKN     0
    dtype: int64
    
  4. There are no missing values left in the dataset. We can now calculate our diversity metric. To get started, we will use the greater than or equal DataFrame method, .ge, to return a DataFrame with a Boolean value for each cell:
    >>> college_ugds.ge(0.15)
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...       False        True  ...     False      False
    Universit...        True        True  ...     False      False
    Amridge U...        True        True  ...     False       True
    Universit...        True       False  ...     False      False
    Alabama S...       False        True  ...     False      False
    ...                  ...         ...  ...       ...        ...
    Hollywood...        True        True  ...     False      False
    Hollywood...       False        True  ...     False      False
    Coachella...        True       False  ...     False      False
    Dewey Uni...       False       False  ...     False      False
    Coastal P...        True        True  ...     False      False
    
  5. From here, we can use the .sum method to count the True values for each college. Notice that a Series is returned:
    >>> diversity_metric = college_ugds.ge(0.15).sum(
    ...     axis="columns"
    ... )
    >>> diversity_metric.head()
    INSTNM
    Alabama A & M University               1
    University of Alabama at Birmingham    2
    Amridge University                     3
    University of Alabama in Huntsville    1
    Alabama State University               1
    dtype: int64
    
  6. To get an idea of the distribution, we will use the .value_counts method on this Series:
    >>> diversity_metric.value_counts()
    1    3042
    2    2884
    3     876
    4      63
    0       7
    5       2
    dtype: int64
    
  7. Amazingly, two schools have more than 15% in five different race categories. Let's sort the diversity_metric Series to find out which ones they are:
    >>> diversity_metric.sort_values(ascending=False).head()
    INSTNM
    Regency Beauty Institute-Austin          5
    Central Texas Beauty College-Temple      5
    Sullivan and Cogliano Training Center    4
    Ambria College of Nursing                4
    Berkeley College-New York                4
    dtype: int64
    
  8. It seems a little suspicious that schools can be that diverse. Let's look at the raw percentages from these top two schools. We will use .loc to select rows based on the index label:
    >>> college_ugds.loc[
    ...     [
    ...         "Regency Beauty Institute-Austin",
    ...         "Central Texas Beauty College-Temple",
    ...     ]
    ... ]
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...
    Regency B...      0.1867      0.2133  ...       0.0     0.2667
    Central T...      0.1616      0.2323  ...       0.0     0.1515
    
  9. It appears that several categories were aggregated into the unknown and two or more races column. Regardless of this, they both appear to be quite diverse. We can see how the top five US News schools fared with this basic diversity metric:
    >>> us_news_top = [
    ...     "Rutgers University-Newark",
    ...     "Andrews University",
    ...     "Stanford University",
    ...     "University of Houston",
    ...     "University of Nevada-Las Vegas",
    ... ]
    >>> diversity_metric.loc[us_news_top]
    INSTNM
    Rutgers University-Newark         4
    Andrews University                3
    Stanford University               3
    University of Houston             3
    University of Nevada-Las Vegas    3
    dtype: int64
    

How it works...

Step 2 counts and then displays the schools with the highest number of missing values. As there are nine columns in the DataFrame, the maximum number of missing values per school is nine. Many schools are missing values for each column. Step 3 removes rows that have all their values missing. The .dropna method in step 3 has the how parameter, which defaults to the string 'any', but may also be changed to 'all'. When set to 'any', it drops rows that contain one or more missing values. When set to 'all', it only drops rows where all values are missing.

In this case, we conservatively drop rows that are missing all values. This is because it's possible that some missing values represent 0 percent. This did not happen to be the case here, as there were no missing values after the dropna method was performed. If there were still missing values, we could have run the .fillna(0) method to fill all the remaining values with 0.

Step 5 begins our diversity metric calculation using the greater than or equal to method, .ge. This results in a DataFrame of all Booleans, which is summed horizontally by setting axis='columns'.

The .value_counts method is used in step 6 to produce a distribution of our diversity metric. It is quite rare for schools to have three races with 15% or more of the undergraduate student population. Step 7 and step 8 find two schools that are the most diverse based on our metric. Although they are diverse, it appears that many of the races are not fully accounted for and are defaulted into the unknown and two or more categories.

Step 9 selects the top five schools from the US News article. It then selects their diversity metric from our newly created Series. It turns out that these schools also score highly with our simple ranking system.

There's more...

Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:

>>> (
...     college_ugds.max(axis=1)
...     .sort_values(ascending=False)
...     .head(10)
... )
INSTNM
Dewey University-Manati                               1.0
Yeshiva and Kollel Harbotzas Torah                    1.0
Mr Leon's School of Hair Design-Lewiston              1.0
Dewey University-Bayamon                              1.0
Shepherds Theological Seminary                        1.0
Yeshiva Gedolah Kesser Torah                          1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
Yeshiva Shaar Hatorah                                 1.0
Bais Medrash Elyon                                    1.0
Yeshiva of Nitra Rabbinical College                   1.0
dtype: float64

We can also determine if any school has all nine race categories exceeding 1%:

>>> (college_ugds > 0.01).all(axis=1).any()
True
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • This is the first book on pandas 1.x
  • Practical, easy to implement recipes for quick solutions to common problems in data using pandas
  • Master the fundamentals of pandas to quickly begin exploring any dataset

Description

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.

Who is this book for?

This book is for Python developers, data scientists, engineers, and analysts. Pandas is the ideal tool for manipulating structured data with Python and this book provides ample instruction and examples. Not only does it cover the basics required to be proficient, but it goes into the details of idiomatic pandas.

What you will learn

  • Master data exploration in pandas through dozens of practice problems
  • Group, aggregate, transform, reshape, and filter data
  • Merge data from different sources through pandas SQL-like operations
  • Create visualizations via pandas hooks to matplotlib and seaborn
  • Use pandas, time series functionality to perform powerful analyses
  • Import, clean, and prepare real-world datasets for machine learning
  • Create workflows for processing big data that doesn't fit in memory
Estimated delivery fee Deliver to Ireland

Premium delivery 7 - 10 business days

€23.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 27, 2020
Length: 626 pages
Edition : 2nd
Language : English
ISBN-13 : 9781839213106
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Ireland

Premium delivery 7 - 10 business days

€23.95
(Includes tracking information)

Product Details

Publication date : Feb 27, 2020
Length: 626 pages
Edition : 2nd
Language : English
ISBN-13 : 9781839213106
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 145.97
Hands-On Data Analysis with Pandas
€48.99
Pandas 1.x Cookbook
€59.99
Python Data Cleaning Cookbook
€36.99
Total 145.97 Stars icon
Banner background image

Table of Contents

16 Chapters
Pandas Foundations Chevron down icon Chevron up icon
Essential DataFrame Operations Chevron down icon Chevron up icon
Creating and Persisting DataFrames Chevron down icon Chevron up icon
Beginning Data Analysis Chevron down icon Chevron up icon
Exploratory Data Analysis Chevron down icon Chevron up icon
Selecting Subsets of Data Chevron down icon Chevron up icon
Filtering Rows Chevron down icon Chevron up icon
Index Alignment Chevron down icon Chevron up icon
Grouping for Aggregation, Filtration, and Transformation Chevron down icon Chevron up icon
Restructuring Data into a Tidy Form Chevron down icon Chevron up icon
Combining Pandas Objects Chevron down icon Chevron up icon
Time Series Analysis Chevron down icon Chevron up icon
Visualization with Matplotlib, Pandas, and Seaborn Chevron down icon Chevron up icon
Debugging and Testing Pandas Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(28 Ratings)
5 star 75%
4 star 7.1%
3 star 7.1%
2 star 10.7%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Ivan and Zorro Mar 28, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is great for both beginners and someone already familiar with pandas. And if you are that someone then you know this library is so big that you never quite feel like you know everything. For instance, how many people use the pandas testing module? What’s that you say , you didn’t know pandas had a testing module? Another highlight is the EDA chapter which also includes some nice coverage of Seaborn, in addition to the chapter devoted entirely to visualization libraries. This will quickly become your goto reference for pandas and working with data.If you are looking for new and interesting ways to leverage pandas and make your work flow more efficient this book will help. It does start at the beginning, but I would be surprised if you didn’t pick up a nugget right in the first chapter. It says pandas but actually the authors spend a lot of time with NumPy although they don’t always distinguish between the two – which does make sense since pandas is for the most part a wrapper around many libraries, NumPy being chief among them.Personally I would have used fewer pages by showing less output, but that is one of the few complaints I have. I just don’t think we really need to see 10 lines of output for every step. Also I nice clear link to github for cloning the data would have been nice.
Amazon Verified review Amazon
Nathan Weston Apr 01, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is a huge book that's packed with details on both basic and advanced usage of Pandas. Each recipe is mostly self-contained, so you can jump straight to any topic you need to know more about without hunting around the book for all the pieces. This does mean there's some amount of repetition if you're reading start to finish, but I think it's a worthwhile trade-off. The accompanying Jupyter Notebook examples are also a nice addition.
Amazon Verified review Amazon
Prayson Daniel Mar 16, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I thought I knew Pandas until I read the first edition three years ago. There are so many new techniques I learnt and relearned by going through this revised version. Upgrading recipes to Python 3.6+ syntax has refined the simplicity and readability of the codes.Two things that stands out and makes me love this book are: (1) the accompanying codes in form of Jupyter Notebooks: Programming is learnt by doing and not just by reading,—having codes for each recipe in a clear and interactive is priceless, (2) the scope: this book goes beyond Pandas syntax. The application and visualisation are thoroughly covered.The only challenge, for beginners, I believe, will be a smooth setting up correct environment to experiment with these recipes. Perhaps creating a Dockerfile/docker-compose file which will build the entire step-up, Python version, libraries, spinning up Jupyter, and attach volumes with a single command, as part of accompanying codes, would make the whole experience divine. In this way, the only requirement needed to enjoy this book is the installation of Docker.
Amazon Verified review Amazon
Tripp Parker Apr 10, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book is well thought out, provides an excellent framework for Pandas. If you've read other books that are by Matt Harrison, then you'll appreciate the quality and context of this book.
Amazon Verified review Amazon
Bookworm_Live Sep 17, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
A really wonderful no-nonsense book. The author gives insight into how Pandas can be used in real world with all practical scenarios and usages. Highly Recommend.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela