Resilient distributed datasets (RDD)
In this section, we will talk about the architecture, motivation, features, and other important concepts related to RDD. We will also briefly talk about the implementation methodology adopted by Spark and various APIs/functions exposed by RDDs.
Frameworks such as Hadoop and MapReduce are widely adopted for parallel and distributed data processing. There is no doubt that these frameworks introduce a new paradigm for distributed data processing and that too in a fault-tolerant manner (without losing a single byte). However, these frameworks do have some limitations; for example, Hadoop is not suited for the problem statements where we need iterative data processing as in recursive functions or machine learning algorithms because this kind of use cases data needs to be in-memory for the computations.
For all these scenarios, a new paradigm, RDD, was introduced that contains all the features of Hadoop-like systems, such as distributed processing, fault-tolerant...