Optimizing Spark SQL performance
In the previous section, you learned how the Catalyst optimizer optimizes user code by running the code through a set of optimization steps until an optimal execution plan is derived. To take advantage of the Catalyst optimizer, it is recommended to use Spark code that leverages the Spark SQL engine—that is, Spark SQL and DataFrame APIs—and avoid using RDD-based Spark code as much as possible. The Catalyst optimizer has no visibility into UDFs, thus users could end up writing sub-optimal code that might degrade performance. Thus, it is recommended to use built-in functions instead of UDFs or to define functions in Scala and Java and then use them in SQL and Python APIs.
Though Spark SQL supports file-based formats such as CSV and JSON, it is recommended to use serialized data formats such as Parquet, AVRO, and ORC. Semi-structured formats such as CSV or JSON incur performance costs, firstly during the schema inference phase, as they...