Packt+ | Advance your knowledge in tech

You're reading from Big Data Analysis with Python Combine Spark and Python to unlock the powers of parallel computing and machine learning

Product type Paperback

Published in Apr 2019

Publisher Packt

ISBN-13 9781789955286

Length 276 pages

Edition 1st Edition

Languages

Python

Tools

Combine

Concepts

Big Data

Authors (3):

Ivan Marin

Ankit Shukla

Sarang VK

View More author details

Table of Contents (11) Chapters

Big Data Analysis with Python

Preface

1. The Python Data Science Stack

2. Statistical Visualizations FREE CHAPTER

3. Working with Big Data Frameworks

4. Diving Deeper with Spark

5. Handling Missing Values and Correlation Analysis

6. Exploratory Data Analysis

7. Reproducibility in Big Data Analysis

8. Creating a Full Analysis Report

Appendix

Aggregation and Grouping

After getting the dataset, our analyst may have to answer a few questions. For example, we know the value of the radionuclide concentration per city, but an analyst may be asked to answer: which state, on average, has the highest radionuclide concentration?

To answer the questions posed, we need to group the data somehow and calculate an aggregation on it. But before we go into grouping data, we have to prepare the dataset so that we can manipulate it in an efficient manner. Getting the right types in a pandas DataFrame can be a huge boost for performance and can be leveraged to enforce data consistency— it makes sure that numeric data really is numeric and allows us to execute operations that we want to use to get the answers.

GroupBy allows us to get a more general view of a feature, arranging data given a GroupBy key and an aggregation operation. In pandas, this operation is done with the GroupBy method, over a selected column, such as State. Note the aggregation operation after the GroupBy method. Some examples of the operations that can be applied are as follows:

mean
median
std (standard deviation)
mad (mean absolute deviation)
sum
count
abs

Note

Several statistics, such as mean and standard deviation, only make sense with numeric data.

After applying GroupBy, a specific column can be selected and the aggregation operation can be applied to it, or all the remaining columns can be aggregated by the same function. Like SQL, GroupBy can be applied to more than one column at a time, and more than one aggregation operation can be applied to selected columns, one operation per column.

The GroupBy command in Pandas has some options, such as as_index, which can override the standard of transforming grouping key's columns to indexes and leaving them as normal columns. This is helpful when a new index will be created after the GroupBy operation, for example.

Aggregation operations can be done over several columns and different statistical methods at the same time with the agg method, passing a dictionary with the name of the column as the key and a list of statistical operations as values.

Exercise 6: Aggregation and Grouping Data

Remember that we have to answer the question of which state has, on average, the highest radionuclide concentration. As there are several cities per state, we have to combine the values of all cities in one state and calculate the average. This is one of the applications of GroupBy: calculating the average values of one variable as per a grouping. We can answer the question using GroupBy:

Import the required libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load the datasets from the https://opendata.socrata.com/:
```
df = pd.read_csv('RadNet_Laboratory_Analysis.csv')
```
Group the DataFrame using the State column.
```
df.groupby('State')
```
Select the radionuclide Cs-134 and calculate the average value per group:
```
df.groupby('State')['Cs-134'].head()
```
Do the same for all columns, grouping per state and applying directly the mean function:
```
df.groupby('State').mean().head()
```
Now, group by more than one column, using a list of grouping columns.
Aggregate using several aggregation operations per column with the agg method. Use the State and Location columns:
```
df.groupby(['State', 'Location']).agg({'Cs-134':['mean', 'std'], 'Te-129':['min', 'max']})
```

NumPy on Pandas

NumPy functions can be applied to DataFrames directly or through the apply and applymap methods. Other NumPy functions, such as np.where, also work with DataFrames.

The rest of the chapter is locked

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $19.99/month. Cancel anytime

Authors (3)

Ivan Marin

Ivan Marin is a systems architect and data scientist working at Daitan Group, a Campinas-based software company. He designs big data systems for large volumes of data and implements machine learning pipelines end to end using Python and Spark. He is also an active organizer of data science, machine learning, and Python in So Paulo, and has given Python for data science courses at university level.

See other products by Ivan Marin

Ankit Shukla

Ankit Shukla is a data scientist working with World Wide Technology, a leading US-based technology solution provider, where he develops and deploys machine learning and artificial intelligence solutions to solve business problems and create actual dollar value for clients. He is also part of the company's R&D initiative, which is responsible for producing intellectual property, building capabilities in new areas, and publishing cutting-edge research in corporate white papers. Besides tinkering with AI/ML models, he likes to read and is a big-time foodie.

See other products by Ankit Shukla

Sarang VK

Sarang VK is a lead data scientist at StraitsBridge Advisors, where his responsibilities include requirement gathering, solutioning, development, and productization of scalable machine learning, artificial intelligence, and analytical solutions using open source technologies. Alongside this, he supports pre-sales and competency.

See other products by Sarang VK