RDD is at the heart of every Spark application. Let's understand the meaning of each word in more detail:
- Resilient: If we look at the meaning of resilient in the dictionary, we can see that it means to be: able to
recover quickly from difficult conditions.
Spark RDD has the ability to recreate itself if something goes wrong. You must be wondering, why does it need to recreate itself? Remember how HDFS and other data stores achieve fault tolerance? Yes, these systems maintain a replica of the data on multiple machines to recover in case of failure. But, as discussed in Chapter 1, Introduction to Apache Spark, Spark is not a data store; Spark is an execution engine. It reads the data from source systems, transforms it, and loads it into the target system. If something goes wrong while performing any of the previous steps, we will lose the data. To provide...