Understanding the Spark DataFrame API
DataFrames are the most commonly used Spark API. They are a special type of Dataset with a type of Row (that is, Dataset[Row]
). The major difference between DataFrames and Datasets is that DataFrames are not strongly typed, hence, data types are not checked at compile time. Because of this, they are arguably easier to work with as they do not require you to provide any structure while defining them.
We do this by creating a DataFrame similar to how we created a Dataset:
val personDf: DataFrame = spark .read .format("parquet") .load(personDataLocation)
This is the output in the Spark console:
Figure 3.8 – DataFrame with our person data in the Spark console
The main difference is that we are not required to specify a type while instantiating the DataFrame
object or on spark.read
. Now, let’s take a look at the Spark SQL module.
Spark SQL
Spark SQL is another way to interact with...