A Resilient Distributed Dataset (RDD) is an immutable, distributed collection of objects. Spark RDDs are resilient or fault tolerant, which enables Spark to recover the RDD in the face of failures. Immutability makes the RDDs read-only once created. Transformations allow operations on the RDD to create a new RDD but the original RDD is never modified once created. This makes RDDs immune to race conditions and other synchronization problems.
The distributed nature of the RDDs works because an RDD only contains a reference to the data, whereas the actual data is contained within partitions across the nodes in the cluster.
Conceptually, a RDD is a distributed collection of elements spread out across multiple nodes in the cluster. We can simplify a RDD to better understand by thinking of a RDD as a large array of integers distributed across machines.
A RDD is...