RDD persistence and cache
Spark jobs usually contains multiple intermediate RDDs on which multiple actions can be called to compute different problems. However, each time an action is called the complete DAG for that action gets called, this not only increases the computing time, but is also wasteful as per CPU and other resources are concerned. To overcome the limitation of re-computing the entire iterative job, Spark provides two different options for persisting the intermediate RDD, that is, cache()
and persist()
. The cache()
method persists the data unserialized in the memory by default .This possibly is the fastest way to retrieve the persisted data, However, use of cache()
comes with some trade off. Each node computing a partition of the RDD persist the resultant on that node itself and hence in case of node failure the data of the RDD partition gets lost. It is then recomputed again, but certain computation time gets lost in the process. Similarly, the persisted data is also unserialized...