Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Big Data Analysis with Python

You're reading from   Big Data Analysis with Python Combine Spark and Python to unlock the powers of parallel computing and machine learning

Arrow left icon
Product type Paperback
Published in Apr 2019
Publisher Packt
ISBN-13 9781789955286
Length 276 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Authors (3):
Arrow left icon
Ivan Marin Ivan Marin
Author Profile Icon Ivan Marin
Ivan Marin
Sarang VK Sarang VK
Author Profile Icon Sarang VK
Sarang VK
Ankit Shukla Ankit Shukla
Author Profile Icon Ankit Shukla
Ankit Shukla
Arrow right icon
View More author details
Toc

Table of Contents (11) Chapters Close

Big Data Analysis with Python
Preface
1. The Python Data Science Stack FREE CHAPTER 2. Statistical Visualizations 3. Working with Big Data Frameworks 4. Diving Deeper with Spark 5. Handling Missing Values and Correlation Analysis 6. Exploratory Data Analysis 7. Reproducibility in Big Data Analysis 8. Creating a Full Analysis Report Appendix

Aggregation and Grouping


After getting the dataset, our analyst may have to answer a few questions. For example, we know the value of the radionuclide concentration per city, but an analyst may be asked to answer: which state, on average, has the highest radionuclide concentration?

To answer the questions posed, we need to group the data somehow and calculate an aggregation on it. But before we go into grouping data, we have to prepare the dataset so that we can manipulate it in an efficient manner. Getting the right types in a pandas DataFrame can be a huge boost for performance and can be leveraged to enforce data consistency— it makes sure that numeric data really is numeric and allows us to execute operations that we want to use to get the answers.

GroupBy allows us to get a more general view of a feature, arranging data given a GroupBy key and an aggregation operation. In pandas, this operation is done with the GroupBy method, over a selected column, such as State. Note the aggregation operation after the GroupBy method. Some examples of the operations that can be applied are as follows:

  • mean

  • median

  • std (standard deviation)

  • mad (mean absolute deviation)

  • sum

  • count

  • abs

Note

Several statistics, such as mean and standard deviation, only make sense with numeric data.

After applying GroupBy, a specific column can be selected and the aggregation operation can be applied to it, or all the remaining columns can be aggregated by the same function. Like SQL, GroupBy can be applied to more than one column at a time, and more than one aggregation operation can be applied to selected columns, one operation per column.

The GroupBy command in Pandas has some options, such as as_index, which can override the standard of transforming grouping key's columns to indexes and leaving them as normal columns. This is helpful when a new index will be created after the GroupBy operation, for example.

Aggregation operations can be done over several columns and different statistical methods at the same time with the agg method, passing a dictionary with the name of the column as the key and a list of statistical operations as values.

Exercise 6: Aggregation and Grouping Data

Remember that we have to answer the question of which state has, on average, the highest radionuclide concentration. As there are several cities per state, we have to combine the values of all cities in one state and calculate the average. This is one of the applications of GroupBy: calculating the average values of one variable as per a grouping. We can answer the question using GroupBy:

  1. Import the required libraries:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
  2. Load the datasets from the https://opendata.socrata.com/:

    df = pd.read_csv('RadNet_Laboratory_Analysis.csv')
  3. Group the DataFrame using the State column.

    df.groupby('State')
  4. Select the radionuclide Cs-134 and calculate the average value per group:

    df.groupby('State')['Cs-134'].head()
  5. Do the same for all columns, grouping per state and applying directly the mean function:

    df.groupby('State').mean().head()
  6. Now, group by more than one column, using a list of grouping columns.

  7. Aggregate using several aggregation operations per column with the agg method. Use the State and Location columns:

    df.groupby(['State', 'Location']).agg({'Cs-134':['mean', 'std'], 'Te-129':['min', 'max']})

NumPy on Pandas

NumPy functions can be applied to DataFrames directly or through the apply and applymap methods. Other NumPy functions, such as np.where, also work with DataFrames.

You have been reading a chapter from
Big Data Analysis with Python
Published in: Apr 2019
Publisher: Packt
ISBN-13: 9781789955286
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image