You're reading from Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Product type Paperback

Published in Feb 2020

Publisher Packt

ISBN-13 9781839213106

Length 626 pages

Edition 2nd Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

Theodore Petrou

Matthew Harrison

View More author details

Table of Contents (17) Chapters

Preface

1. Pandas Foundations

2. Essential DataFrame Operations FREE CHAPTER

3. Creating and Persisting DataFrames

4. Beginning Data Analysis

5. Exploratory Data Analysis

6. Selecting Subsets of Data

7. Filtering Rows

8. Index Alignment

9. Grouping for Aggregation, Filtration, and Transformation

10. Restructuring Data into a Tidy Form

11. Combining Pandas Objects

12. Time Series Analysis

13. Visualization with Matplotlib, Pandas, and Seaborn

14. Debugging and Testing Pandas

15. Other Books You May Enjoy

16. Index

Transposing the direction of a DataFrame operation

Many DataFrame methods have an axis parameter. This parameter controls the direction in which the operation takes place. Axis parameters can be 'index' (or 0) or 'columns' (or 1). I prefer the string versions are they are more explicit and tend to make the code easier to read.

Nearly all DataFrame methods default the axis parameter to 0, which applies to operations along the index. This recipe shows you how to invoke the same method along both axes.

How to do it...

Read in the college dataset; the columns that begin with UGDS represent the percentage of the undergraduate students of a particular race. Use the filter method to select these columns:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

Now that the DataFrame contains homogenous column data, operations can be sensibly done both vertically and horizontally. The .count method returns the number of non-missing values. By default, its axis parameter is set to 0:
```
>>> college_ugds.count()
UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64
```
The axis parameter is almost always set to 0. So, step 2 is equivalent to both college_ugds.count(axis=0) and college_ugds.count(axis='index').

Changing the axis parameter to 'columns' changes the direction of the operation so that we get back a count of non-missing items in each row:

>>> college_ugds.count(axis="columns").head()
INSTNM
Alabama A & M University               9
University of Alabama at Birmingham    9
Amridge University                     9
University of Alabama in Huntsville    9
Alabama State University               9
dtype: int64

Instead of counting non-missing values, we can sum all the values in each row. Each row of percentages should add up to 1. The .sum method may be used to verify this:

>>> college_ugds.sum(axis="columns").head()
INSTNM
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

To get an idea of the distribution of each column, the .median method can be used:

>>> college_ugds.median(axis="index")
UGDS_WHITE    0.55570
UGDS_BLACK    0.10005
UGDS_HISP     0.07140
UGDS_ASIAN    0.01290
UGDS_AIAN     0.00260
UGDS_NHPI     0.00000
UGDS_2MOR     0.01750
UGDS_NRA      0.00000
UGDS_UNKN     0.01430
dtype: float64

How it works...

The direction of operation on the axis is one of the more confusing aspects of pandas. Many pandas users have difficulty remembering the meaning of the axis parameter. I remember them by reminding myself that a Series only has one axis, the index (or 0). A DataFrame also has an index (axis 0) and columns (axis 1).

There's more...

The .cumsum method with axis=1 accumulates the race percentages across each row. It gives a slightly different view of the data. For example, it is very easy to see the exact percentage of white and black students for each school:

>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9686  ...    0.9862     1.0000
Universit...      0.5922      0.8522  ...    0.9899     0.9999
Amridge U...      0.2990      0.7182  ...    0.7285     1.0000
Universit...      0.6988      0.8243  ...    0.9650     1.0000
Alabama S...      0.0158      0.9366  ...    0.9863     1.0000

You're reading from Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Table of Contents (17) Chapters

Transposing the direction of a DataFrame operation

How to do it...

How it works...

There's more...

Authors (2)

Other recommended products

Personalised recommendations for you