Understanding built-in Spark functions
Spark gives us several built-in functions for working with DataFrames. These functions are built in such a way that they can be optimized by the catalyst optimizer. The catalyst optimizer is an essential component of the Spark program that helps to optimize our code using advanced programming constructs. It works very well with Spark DataFrames and built-in functions (higher-order functions). However, in the case of a UDF, the catalyst optimizer treats it as a black box. As a result, we see performance bottlenecks.
To learn about all the built-in functions in PySpark, check out the official documentation:
In the following example, we are going to see performance differences between Spark higher-order functions and UDFs:
- Let's begin by creating a Spark DataFrame in a new cell:
from pyspark.sql.types import * manual_schema = StructType([ StructField('Year',IntegerType(),True), StructField(...