Understanding bucketing in Spark
Bucketing is an optimization technique that helps to prevent shuffling and sorting of data during compute-heavy operations such as joins. Based on the bucketing columns we specify, data is collected in a number of bins. Bucketing is similar to partitioning, but in the case of partitioning, we create directories for each partition. In bucketing, we create equal-sized buckets, and data is distributed across these buckets by a hash on the value of the bucket. Partitioning is helpful when filtering data, whereas bucketing is more helpful during joins.
It is often helpful to perform bucketing on dimension tables that contain primary keys for joining. Bucketing is also helpful when join operations are being performed between small and large tables. In this section, we will go through a quick example to understand how to implement bucketing on a Hive table:
- We will begin by creating a Spark DataFrame in a new cell. Run the following code block:
from...