Schema – structure of data
A schema is the description of the structure of your data and can be either implicit or explicit. There are two main ways to convert existing RDDs into datasets as the DataFrames are internally based on the RDD; they are as follows:
- Using reflection to infer the schema of the RDD
- Through a programmatic interface with the help of which you can take an existing RDD and render a schema to convert the RDD into a dataset with schema
Implicit schema
Let's look at an example of loading a comma-separated values (CSV) file into a DataFrame. Whenever a text file contains a header, the read API can infer the schema by reading the header line. We also have the option to specify the separator to be used to split the text file lines.
We read the csv
inferring the schema from the header line and use the comma (,
) as the separator. We also show the use of the schema
command and the printSchema
command to verify the schema of the input file:
scala> val statesDF = spark.read.option...