Aggregation operations
We have seen how to apply an operation to every row in a DataFrame to create a new column, and we have seen how to use filters to build new DataFrames with a sub-set of rows from the original DataFrame. The last set of operations on DataFrames is grouping operations, equivalent to the GROUP BY
statement in SQL. Let's calculate the average BMI for smokers and non-smokers. We must first tell Spark to group the DataFrame by a column (the isSmoker
column, in this case), and then apply an aggregation operation (averaging, in this case) to reduce each group:
scala> val smokingDF = readingsWithBmiDF.groupBy( "isSmoker").agg(avg("BMI")) smokingDF: org.apache.spark.sql.DataFrame = [isSmoker: boolean, AVG(BMI): double]
This has created a new DataFrame with two columns: the grouping column and the column over which we aggregated. Let's show this DataFrame:
scala> smokingDF.show +--------+------------------+ |isSmoker| AVG(BMI)| +...