Interoperating with RDDs
There are two different methods for converting existing RDDs to DataFrames (or Datasets[T]): inferring the schema using reflection, or programmatically specifying the schema. The former allows you to write more concise code (when your Spark application already knows the schema), while the latter allows you to construct DataFrames when the columns and their data types are only revealed at run time. Note, reflection is in reference to schema reflection as opposed to Python reflection
.
Inferring the schema using reflection
In the process of building the DataFrame and running the queries, we skipped over the fact that the schema for this DataFrame was automatically defined. Initially, row objects are constructed by passing a list of key/value pairs as **kwargs
to the row class. Then, Spark SQL converts this RDD of row objects into a DataFrame, where the keys are the columns and the data types are inferred by sampling the data.
Tip
The **kwargs
construct allows you to pass...