Dealing with Join performance issues with big fact and small dimension tables in ETL workloads
In a scenario where you are joining a big fact table with a small dimension table, Spark can apply the join operation using two different join techniques – it can use a Sort Merge/Shuffle Hash join if both tables are bigger or a Broadcast join if one of the datasets for the underlying table is small enough to be stored in the Spark memory of all executors.
A broadcast join can significantly increase performance and helps with optimizing join operations. A join operation can result in a large data shuffle across the network between the different executors running on multiple workers. This leads to out-of-memory (OOM) errors or data spilling to physical disks on the respective workers. While using a broadcast join, you must ensure the smaller table is broadcasted to the executors running on the worker nodes. By doing so, each of the executors running on the workers will be capable...