Grouping data in Spark and different Spark joins
We will start with one of the most important data manipulation techniques: grouping and joining data. When we are doing data exploration, grouping data based on different criteria becomes essential to data analysis. We will look at how we can group different data using groupBy
.
Using groupBy in a DataFrame
We can group data in a DataFrame based on different criteria – for example, we can group data based on different columns in a DataFrame. We can also apply different aggregations, such as sum
or average
, to this grouped data to get a holistic view of data slices.
For this purpose, in Spark, we have the groupBy
operation. The groupBy
operation is similar to groupBy
in SQL in that we can do group-wise operations on these grouped datasets. Moreover, we can specify multiple groupBy
criteria in a single groupBy
statement. The following example shows how to use groupBy
in PySpark. We will use the DataFrame salary data we created...