Data abstractions in Apache Spark
The MapReduce framework and its popular open source implementation Hadoop enjoyed widespread adoption in the past decade. However, iterative algorithms and interactive ad-hoc querying are not well supported. Any data sharing between jobs or stages within an algorithm is always through disk writes and reads as against in-memory data sharing. So, the logical next step would be to have a mechanism that facilitates reuse of intermediate results across multiple jobs. RDD is a general-purpose data abstraction that was developed to address this requirement.
RDD is the core abstraction in Apache Spark. It is an immutable, fault-tolerant distributed collection of statically typed objects that are usually stored in-memory. RDD API offer simple operations such as map, reduce, and filter that can be composed in arbitrary ways.
DataFrame abstraction is built on top of RDD and it adds "named" columns. So, a Spark DataFrame has rows of named columns similar to...