RDDs
Apache Spark’s RDD stands as a foundational abstraction that underpins the distributed computing capabilities within the Spark framework. RDDs serve as the core data structure in Spark, enabling fault-tolerant and parallel operations on large-scale distributed datasets and they are immutable. This means that they cannot be changed over time. For any operations, a new RDD has to be generated from the existing RDD. When a new RDD originates from the original RDD, the new RDD has a pointer to the RDD it is generated from. This is the way Spark documents the lineage for all the transformations taking place on an RDD. This lineage enables lazy evaluation in Spark, which generates DAGs for different operations.
This immutability and lineage gives Spark the ability to reproduce any DataFrame in case of failure and it makes fault-tolerant by design. Since RDD is the lowest level of abstraction in Spark, all other datasets built on top of RDDs share these properties. The high...