Spark abstractions
The goal of this book is that you get a good understanding of Spark via hands-on programming. The best way to understand Spark is to work through operations iteratively. As we are still in the initial chapters, some of the things might not be very clear, but they should be clear enough for the current context. As you write code and read further chapters, you will gather more information and insight. With this in mind, let's move to a quick discussion on Spark abstractions. We will revisit the abstractions in more detail in the following chapters.
The main features of Apache Spark are distributed data representation and computation, thus achieving massive scaling of data operations. Spark's primary unit for representation of data is RDD, which allows for easy parallel operations on the data. Until 2.0.0, everyone worked with RDDs. However, they are low-level raw structures, which can be optimized for performance and scalability.
This is where Datasets/DataFrames come into...