Learning about shuffle partitions
In this recipe, you will learn how to set the spark.sql.shuffle.partitions
parameter and see the impact it has on performance when there are fewer partitions.
Most of the time, in the case of wide transformations, where data is required from other partitions, Spark performs a data shuffle. Unfortunately, you can't avoid such transformations, but we can configure parameters to reduce the impact this has on performance.
Wide transformations uses shuffle partitions to shuffle data. However, irrespective of the data's size or the number of executors, the number of partitions is set to 200
.
The data shuffle procedure is triggered by data transformations such as join()
, union()
, groupByKey(
), reduceBykey()
, and so on. The spark.sql.shuffle.partitions
configuration sets the number of partitions to use during data shuffling. The partition numbers are set to 200 by default when Spark performs data shuffling.
Getting ready
You can...