What is DataSet API?
Spark announced Dataset API in Spark 1.6, an extension of DataFrame API representing a strongly-typed immutable collection of objects mapped to a relational schema. Dataset API was developed to take advantage of the Catalyst optimiser by exposing expressions and data fields to the query planner. Dataset brings the compile-type safety, which means you can check your production applications for errors before they are run, an issue that constantly comes up with DataFrame API.
One of the major benefits of DataSet API was a reduction in the amount of memory being used, as Spark framework understood the structure of the data in the dataset and hence created an optimal layout in the memory space when caching datasets. Tests have shown that DataSet API can utilize 4.5x lesser memory space compared to the same data representation with an RDD.
Figure 4.1 shows analysis errors shown by Spark with various APIs for a distributed job with SQL at one end of the spectrum and Datasets...