Chapter 3: Understanding Spark Query Execution
To write efficient Spark applications, we need to have some understanding of how Spark executes queries. Having a good understanding of how Spark executes a given query helps big data developers/engineers work efficiently with large volumes of data.
Query execution is a very broad subject, and, in this chapter, we will start by understanding jobs, stages, and tasks. Then, we will learn how Spark lazy evaluation works. Following this, we will learn how to check and understand the execution plan when working with DataFrames or SparkSQL. Later, we will learn how joins work in Spark and the different types of join algorithms Spark uses while joining two tables. Finally, we will learn about the input, output, and shuffle partitions and the storage benefits of using different file formats.
Knowing about the internals will help you troubleshoot and debug your Spark applications more efficiently. By the end of this chapter, you will know...