Addressing data skew
In Spark, data resides in different "partitions" that guide the decision of how to divide the data among different worker nodes to get the benefits of parallelism. In an ideal case, data in each of the partitions is divided equally so that the load on the workers is uniform and the cluster resources are utilized more efficiently. Data skew is a condition in which a table's data is unevenly distributed among partitions in the cluster. This has several negative consequences, namely a reduction in the performance of queries, especially those that involve joins. Joins typically result in shuffle and data skew, which can lead to a labor imbalance among the workers. This means that only a few workers are doing the heavy lifting, prolonging the query response time and resulting in unnecessary compute wastage. Let's look at the four main types of joins:
- Broadcast Hash Join:
- Requires one side to be small.
- No shuffle nor sort is involved...