Chapter 7: Spark Core
Performance tuning in Apache Spark plays an instrumental role in running efficient big data workloads. More often than not, the optimization techniques employed to prevent the shuffling and skewing of data drastically improve performance. In this chapter, we will learn about the Spark optimization techniques directly related to Spark Core that help prevent the shuffling and skewing of data.
We will begin by learning about broadcast joins and how they are different from traditional joins in Spark. Next, we will learn about Apache Arrow, its integration with the Python pandas project, and how it improves the performance of Pandas code in Azure Databricks. We will also learn about shuffle partitions, Spark caching, and adaptive query execution (AQE). Shuffle partitions can often become performance bottlenecks, and it is important that we learn how to tune them. Spark caching is another popular optimization technique that helps to speed up queries on the same data...