The DataFrame API and the Spark SQL API
Spark provides different APIs built on top of the core RDD API (the native, low-level Spark language) to make it easier to develop distributed data processing applications. The two most popular higher-level APIs are the DataFrame API and the Spark SQL API.
The DataFrames API provides a domain-specific language to manipulate distributed datasets organized into named columns. Conceptually, it is equivalent to a table in a relational database or a DataFrame in Python pandas, but with richer optimizations under the hood. The DataFrames API enables users to abstract data processing operations behind domain-specific terminology such as grouping and joining instead of thinking in map
and reduce
operations.
The Spark SQL API builds further on top of the DataFrames API by exposing Spark SQL, a Spark module for structured data processing. Spark SQL allows users to run SQL queries against DataFrames to filter or aggregate data. The SQL queries get...