Resilient Distributed Datasets (RDDs) are collections of immutable JVM objects that are distributed across an Apache Spark cluster. Please note that if you are new to Apache Spark, you may want to initially skip this chapter as Spark DataFrames/Datasets are both significantly easier to develop and typically have faster performance. More information on Spark DataFrames can be found in the next chapter.
An RDD is the most fundamental dataset type of Apache Spark; any action on a Spark DataFrame eventually gets translated into a highly optimized execution of transformations and actions on RDDs (see the paragraph on catalyst optimizer in Chapter 3, Abstracting Data with DataFrames, in the Introduction section).Â
Data in an RDD is split into chunks based on a key and then dispersed across all the executor nodes. RDDs are highly resilient, that is, there are able...