Aggregation and Grouping
After getting the dataset, our analyst may have to answer a few questions. For example, we know the value of the radionuclide concentration per city, but an analyst may be asked to answer: which state, on average, has the highest radionuclide concentration?
To answer the questions posed, we need to group the data somehow and calculate an aggregation on it. But before we go into grouping data, we have to prepare the dataset so that we can manipulate it in an efficient manner. Getting the right types in a pandas DataFrame can be a huge boost for performance and can be leveraged to enforce data consistency— it makes sure that numeric data really is numeric and allows us to execute operations that we want to use to get the answers.
GroupBy allows us to get a more general view of a feature, arranging data given a GroupBy key and an aggregation operation. In pandas, this operation is done with the GroupBy method, over a selected column, such as State. Note the aggregation operation after the GroupBy method. Some examples of the operations that can be applied are as follows:
mean
median
std (standard deviation)
mad (mean absolute deviation)
sum
count
abs
Note
Several statistics, such as mean and standard deviation, only make sense with numeric data.
After applying GroupBy, a specific column can be selected and the aggregation operation can be applied to it, or all the remaining columns can be aggregated by the same function. Like SQL, GroupBy can be applied to more than one column at a time, and more than one aggregation operation can be applied to selected columns, one operation per column.
The GroupBy command in Pandas has some options, such as as_index, which can override the standard of transforming grouping key's columns to indexes and leaving them as normal columns. This is helpful when a new index will be created after the GroupBy operation, for example.
Aggregation operations can be done over several columns and different statistical methods at the same time with the agg method, passing a dictionary with the name of the column as the key and a list of statistical operations as values.
Exercise 6: Aggregation and Grouping Data
Remember that we have to answer the question of which state has, on average, the highest radionuclide concentration. As there are several cities per state, we have to combine the values of all cities in one state and calculate the average. This is one of the applications of GroupBy: calculating the average values of one variable as per a grouping. We can answer the question using GroupBy:
Import the required libraries:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
Load the datasets from the https://opendata.socrata.com/:
df = pd.read_csv('RadNet_Laboratory_Analysis.csv')
Group the DataFrame using the State column.
df.groupby('State')
Select the radionuclide Cs-134 and calculate the average value per group:
df.groupby('State')['Cs-134'].head()
Do the same for all columns, grouping per state and applying directly the mean function:
df.groupby('State').mean().head()
Now, group by more than one column, using a list of grouping columns.
Aggregate using several aggregation operations per column with the agg method. Use the State and Location columns:
df.groupby(['State', 'Location']).agg({'Cs-134':['mean', 'std'], 'Te-129':['min', 'max']})
NumPy on Pandas
NumPy functions can be applied to DataFrames directly or through the apply and applymap methods. Other NumPy functions, such as np.where, also work with DataFrames.