Spark 2.x – advent of data frames and datasets
With Spark 2.x we have two new spark computational abstractions:
- Data frames: These are distributed, resilient, fault tolerant in-memory data structures that are capable of handling only structured data, which means they are designed to manage data that can be segregated in fixed typed columns. Though it may sound like a limitation with respect to RDD, which can handle any type of unstructured data, in practical terms this structured abstraction over the data makes it very easy to manipulate and work over a large volume of structured data, the way we used to with RDBMS.
- Datasets: It's an extension of the Spark data frame. It's a type safe object-oriented interface. For the sake of simplicity, one could say that data frames are actually an un-typed dataset. This newest API in spark pragmatic abstraction actually leverages features of tungsten in-memory encoding and catalysts optimizer.