You're reading from Databricks Certified Associate Developer for Apache Spark Using Python The ultimate guide to getting certified in Apache Spark using practical examples with Python

Product type Paperback

Published in Jun 2024

Publisher Packt

ISBN-13 9781804619780

Length 274 pages

Edition 1st Edition

Languages

Python

Tools

Apache Spark

Concepts

Data Engineering

Author (1):

Saba Shah

View More author details

Table of Contents (18) Chapters

Preface

1. Part 1: Exam Overview

2. Chapter 1: Overview of the Certification Guide and Exam FREE CHAPTER

3. Part 2: Introducing Spark

4. Chapter 2: Understanding Apache Spark and Its Applications

5. Chapter 3: Spark Architecture and Transformations

6. Part 3: Spark Operations

7. Chapter 4: Spark DataFrames and their Operations

8. Chapter 5: Advanced Operations and Optimizations in Spark

9. Chapter 6: SQL Queries in Spark

10. Part 4: Spark Applications

11. Chapter 7: Structured Streaming in Spark

12. Chapter 8: Machine Learning with Spark ML

13. Part 5: Mock Papers

14. Chapter 9: Mock Test 1

15. Chapter 10: Mock Test 2

16. Index

Why subscribe?

17. Other Books You May Enjoy

Grouping data in Spark and different Spark joins

We will start with one of the most important data manipulation techniques: grouping and joining data. When we are doing data exploration, grouping data based on different criteria becomes essential to data analysis. We will look at how we can group different data using groupBy.

Using groupBy in a DataFrame

We can group data in a DataFrame based on different criteria – for example, we can group data based on different columns in a DataFrame. We can also apply different aggregations, such as sum or average, to this grouped data to get a holistic view of data slices.

For this purpose, in Spark, we have the groupBy operation. The groupBy operation is similar to groupBy in SQL in that we can do group-wise operations on these grouped datasets. Moreover, we can specify multiple groupBy criteria in a single groupBy statement. The following example shows how to use groupBy in PySpark. We will use the DataFrame salary data we created...