Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Pandas 1.x Cookbook

You're reading from   Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Arrow left icon
Product type Paperback
Published in Feb 2020
Publisher Packt
ISBN-13 9781839213106
Length 626 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Theodore Petrou Theodore Petrou
Author Profile Icon Theodore Petrou
Theodore Petrou
Matthew Harrison Matthew Harrison
Author Profile Icon Matthew Harrison
Matthew Harrison
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Pandas Foundations 2. Essential DataFrame Operations FREE CHAPTER 3. Creating and Persisting DataFrames 4. Beginning Data Analysis 5. Exploratory Data Analysis 6. Selecting Subsets of Data 7. Filtering Rows 8. Index Alignment 9. Grouping for Aggregation, Filtration, and Transformation 10. Restructuring Data into a Tidy Form 11. Combining Pandas Objects 12. Time Series Analysis 13. Visualization with Matplotlib, Pandas, and Seaborn 14. Debugging and Testing Pandas 15. Other Books You May Enjoy
16. Index

Determining college campus diversity

Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:

>>> pd.read_csv(
...     "data/college_diversity.csv", index_col="School"
... )
                                                   Diversity Index
School
Rutgers University--Newark  Newark, NJ                        0.76
Andrews University  Berrien Springs, MI                       0.74
Stanford University  Stanford, CA                             0.74
University of Houston  Houston, TX                            0.74
University of Nevada--Las Vegas  Las Vegas, NV                0.74
University of San Francisco  San Francisco, CA                0.74
San Francisco State University  San Francisco, CA             0.73
University of Illinois--Chicago  Chicago, IL                  0.73
New Jersey Institute of Technology  Newark, NJ                0.72
Texas Woman's University  Denton, TX                          0.72

Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.

How to do it...

  1. Read in the college dataset, and filter for just the undergraduate race columns:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = college.filter(like="UGDS_")
    
  2. Many of these colleges have missing values for all their race columns. We can count all the missing values for each row and sort the resulting Series from the highest to lowest. This will reveal the colleges that have missing values:
    >>> (
    ...     college_ugds.isnull()
    ...     .sum(axis="columns")
    ...     .sort_values(ascending=False)
    ...     .head()
    ... )
    INSTNM
    Excel Learning Center-San Antonio South         9
    Philadelphia College of Osteopathic Medicine    9
    Assemblies of God Theological Seminary          9
    Episcopal Divinity School                       9
    Phillips Graduate Institute                     9
    dtype: int64
    
  3. Now that we have seen the colleges that are missing all their race columns, we can use the .dropna method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values:
    >>> college_ugds = college_ugds.dropna(how="all")
    >>>; college_ugds.isnull().sum()
    UGDS_WHITE    0
    UGDS_BLACK    0
    UGDS_HISP     0
    UGDS_ASIAN    0
    UGDS_AIAN     0
    UGDS_NHPI     0
    UGDS_2MOR     0
    UGDS_NRA      0
    UGDS_UNKN     0
    dtype: int64
    
  4. There are no missing values left in the dataset. We can now calculate our diversity metric. To get started, we will use the greater than or equal DataFrame method, .ge, to return a DataFrame with a Boolean value for each cell:
    >>> college_ugds.ge(0.15)
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...       False        True  ...     False      False
    Universit...        True        True  ...     False      False
    Amridge U...        True        True  ...     False       True
    Universit...        True       False  ...     False      False
    Alabama S...       False        True  ...     False      False
    ...                  ...         ...  ...       ...        ...
    Hollywood...        True        True  ...     False      False
    Hollywood...       False        True  ...     False      False
    Coachella...        True       False  ...     False      False
    Dewey Uni...       False       False  ...     False      False
    Coastal P...        True        True  ...     False      False
    
  5. From here, we can use the .sum method to count the True values for each college. Notice that a Series is returned:
    >>> diversity_metric = college_ugds.ge(0.15).sum(
    ...     axis="columns"
    ... )
    >>> diversity_metric.head()
    INSTNM
    Alabama A & M University               1
    University of Alabama at Birmingham    2
    Amridge University                     3
    University of Alabama in Huntsville    1
    Alabama State University               1
    dtype: int64
    
  6. To get an idea of the distribution, we will use the .value_counts method on this Series:
    >>> diversity_metric.value_counts()
    1    3042
    2    2884
    3     876
    4      63
    0       7
    5       2
    dtype: int64
    
  7. Amazingly, two schools have more than 15% in five different race categories. Let's sort the diversity_metric Series to find out which ones they are:
    >>> diversity_metric.sort_values(ascending=False).head()
    INSTNM
    Regency Beauty Institute-Austin          5
    Central Texas Beauty College-Temple      5
    Sullivan and Cogliano Training Center    4
    Ambria College of Nursing                4
    Berkeley College-New York                4
    dtype: int64
    
  8. It seems a little suspicious that schools can be that diverse. Let's look at the raw percentages from these top two schools. We will use .loc to select rows based on the index label:
    >>> college_ugds.loc[
    ...     [
    ...         "Regency Beauty Institute-Austin",
    ...         "Central Texas Beauty College-Temple",
    ...     ]
    ... ]
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...
    Regency B...      0.1867      0.2133  ...       0.0     0.2667
    Central T...      0.1616      0.2323  ...       0.0     0.1515
    
  9. It appears that several categories were aggregated into the unknown and two or more races column. Regardless of this, they both appear to be quite diverse. We can see how the top five US News schools fared with this basic diversity metric:
    >>> us_news_top = [
    ...     "Rutgers University-Newark",
    ...     "Andrews University",
    ...     "Stanford University",
    ...     "University of Houston",
    ...     "University of Nevada-Las Vegas",
    ... ]
    >>> diversity_metric.loc[us_news_top]
    INSTNM
    Rutgers University-Newark         4
    Andrews University                3
    Stanford University               3
    University of Houston             3
    University of Nevada-Las Vegas    3
    dtype: int64
    

How it works...

Step 2 counts and then displays the schools with the highest number of missing values. As there are nine columns in the DataFrame, the maximum number of missing values per school is nine. Many schools are missing values for each column. Step 3 removes rows that have all their values missing. The .dropna method in step 3 has the how parameter, which defaults to the string 'any', but may also be changed to 'all'. When set to 'any', it drops rows that contain one or more missing values. When set to 'all', it only drops rows where all values are missing.

In this case, we conservatively drop rows that are missing all values. This is because it's possible that some missing values represent 0 percent. This did not happen to be the case here, as there were no missing values after the dropna method was performed. If there were still missing values, we could have run the .fillna(0) method to fill all the remaining values with 0.

Step 5 begins our diversity metric calculation using the greater than or equal to method, .ge. This results in a DataFrame of all Booleans, which is summed horizontally by setting axis='columns'.

The .value_counts method is used in step 6 to produce a distribution of our diversity metric. It is quite rare for schools to have three races with 15% or more of the undergraduate student population. Step 7 and step 8 find two schools that are the most diverse based on our metric. Although they are diverse, it appears that many of the races are not fully accounted for and are defaulted into the unknown and two or more categories.

Step 9 selects the top five schools from the US News article. It then selects their diversity metric from our newly created Series. It turns out that these schools also score highly with our simple ranking system.

There's more...

Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:

>>> (
...     college_ugds.max(axis=1)
...     .sort_values(ascending=False)
...     .head(10)
... )
INSTNM
Dewey University-Manati                               1.0
Yeshiva and Kollel Harbotzas Torah                    1.0
Mr Leon's School of Hair Design-Lewiston              1.0
Dewey University-Bayamon                              1.0
Shepherds Theological Seminary                        1.0
Yeshiva Gedolah Kesser Torah                          1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
Yeshiva Shaar Hatorah                                 1.0
Bais Medrash Elyon                                    1.0
Yeshiva of Nitra Rabbinical College                   1.0
dtype: float64

We can also determine if any school has all nine race categories exceeding 1%:

>>> (college_ugds > 0.01).all(axis=1).any()
True
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020
Publisher: Packt
ISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image