Understanding the use of inferSchema
The inferSchema
option is very often used to make Spark infer the data types automatically. While this approach works well for smaller datasets, performance bottlenecks can develop as the size of the data being scanned increases. In order to better understand the challenges that come with using this option for big data, we will perform a couple of experiments.
Experiment 1
In this experiment, we will re-run the code block that we ran in the previous section:
airlines_1987_to_2008 = ( spark .read .option("header",True) .option("delimiter",",") .option("inferSchema",True) .csv("dbfs:/databricks-datasets/asa/airlines/*") ) display(airlines_1987_to_2008)
The code block simply reads CSV files and creates a Spark DataFrame by automatically inferring the schema. Note the time it takes for the job to run. For us, it took...