In Spark, the optimizer's goal is to minimize end-to-end query response time. It is based on two key ideas:
Pruning unnecessary data as early as possible, for example, filter pushdown and column pruning.
Minimizing per-operator cost, for example, broadcast versus shuffle and optimal join order.
Till Spark 2.1, Catalyst was essentially a rule-based optimizer. Most Spark SQL optimizer rules are heuristic rules: PushDownPredicate, ColumnPruning, ConstantFolding, and so on. They do not consider the cost of each operator or selectivity when estimating JOIN relation sizes. Therefore, the JOIN order is mostly decided by its position in SQL queries and the physical join implementation is decided based on heuristics. This can lead to suboptimal plans being generated. However, if the cardinalities are known in advance, more efficient...