Cost-based optimizer in Apache Spark 2.2
In Spark, the optimizer's goal is to minimize end-to-end query response time. It is based on two key ideas:
Pruning unnecessary data as early as possible, for example, filter pushdown and column pruning.
Minimizing per-operator cost, for example, broadcast shuffle and optimal join order.
Till Spark 2.1, Catalyst was essentially a rule-based optimizer. Most Spark SQL optimizer rules are heuristic rules:Â PushDownPredicate
, ColumnPruning
, ConstantFolding
, and so on. They do not consider the cost of each operator or selectivity when estimating JOIN
relation sizes. Therefore, the JOIN
order is mostly decided by its position in SQL queries
and the physical join implementation is decided based on heuristics. This can lead to suboptimal plans being generated. However, if the cardinalities are known in advance, more efficient queries can be obtained. The goal of the CBO optimizer is to do exactly that, automatically.
Huawei implemented the CBO in Spark SQL initially...