Understanding shuffle partitions
Every time Spark performs a wide transformation or aggregations, shuffling of data across the nodes occurs. And during these shuffle operations, Spark, by default, changes the partitions of the DataFrame. For example, when creating a DataFrame, it may have 10 partitions, but as soon as the shuffle occurs, Spark may change the partitions of the DataFrame to 200. These are what we call the shuffle partitions.
This is a default behavior in Spark, but it can be altered to improve the performance of Spark jobs. We can also confirm the default behavior by running the following line of code:
spark.conf.get('spark.sql.shuffle.partitions')
This returns the output of 200
. This means that Spark will change the shuffle partitions to 200
by default. To alter this configuration, we can run the following code, which configures the shuffle partitions to 8
:
spark.conf.set('spark.sql.shuffle.partitions',8)
You may be wondering why we...