The Catalyst Optimizer is one of the most exciting developments in Apache Spark. This is because it basically frees your mind from writing effective data processing pipelines, and lets the optimizer do it for you.
In this chapter, we will like to introduce the Catalyst Optimizer of Apache Spark SQL running on top of SQL, DataFrames, and Datasets.
This chapter will cover the following topics:
- The catalog
- Abstract syntax trees
- The optimization process on logical and physical execution plans
- Code generation
- One practical code walk-through