Data modalities and Datasets/DataFrames/RDDs
Now let's tie together the modalities with the Spark abstractions and see how we can read and write data. Before 2.0.0, things were conceptually simpler-we only needed to read data into RDDs and use map()
to transform the data as required. However, data wrangling was harder. With Dataset/DataFrame, we have the ability to read directly into a table with headings, associate data types with domain semantics, and start working with data more effectively.
As a general rule of thumb, perform the following steps:
Use
SparkContext
and RDDs to handle unstructured data.Use
SparkSession
and Datasets/DataFrames for semi-structured and structured data. As you will see in the later chapters,SparkSession
has unified the read from various formats, such as the.csv
,.json
,.parquet
,.jdbc
,.orc
, and.text
files. Moreover, there is a pluggable architecture called DataSource API to access any type of structured data.