Persisting RDDs
This recipe shows how to persist an RDD. As a known fact, RDDs are lazily evaluated and sometimes it is necessary to reuse the RDD multiple times. In such cases, Spark will re-compute the RDD and all of its dependencies, each time we call an action on the RDD. This is expensive for iterative algorithms which need the computed dataset multiple times. To avoid computing an RDD multiple times, Spark provides a mechanism for persisting the data in an RDD.
After the first time an action computes the RDD's contents, they can be stored in memory or disk across the cluster. The next time an action depends on the RDD, it need not be recomputed from its dependencies.
Getting ready
To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos.
How to do it…
Let's see how to persist RDDs using the following code:
val inputRdd = sc.parallelize(Array("this,is,a,ball","it,is,a,cat","julie,is,in,the,church")) val wordsRdd = inputRdd.flatMap(record => record.split(",")) val wordLengthPairs = wordsRdd.map(word before code=> (word, word.length)) val wordPairs = wordsRdd.map(word => (word,1)) val reducedWordCountRdd = wordPairs.reduceByKey((x,y) => x+y) val filteredWordLengthPairs = wordLengthPairs.filter{case(word,length) => length >=3} reducedWordCountRdd.cache() val joinedRdd = reducedWordCountRdd.join(filteredWordLengthPairs) joinedRdd.persist(StorageLevel.MEMORY_AND_DISK) val wordPairsCount = reducedWordCountRdd.count val wordPairsCollection = reducedWordCountRdd.take(10) val joinedRddCount = joinedRdd.count val joinedPairs = joinedRdd.collect() reducedWordCountRdd.unpersist() joinedRdd.unpersist()
How it works…
The call to cache()
on reducedWordCountRdd
indicates that the RDD should be stored in memory for the next time it's computed. The count
action computes it initially. When the take
action is invoked, it accesses the cached elements of the RDD instead of re-computing them from the dependencies.
Spark defines levels of persistence or StorageLevel
values for persisting RDDs. rdd.cache()
is shorthand for rdd.persist(StorageLevel.MEMORY)
. In the preceding example, joinedRdd
is persisted with storage level as MEMORY_AND_DISK
which indicates persisting the RDD in memory as well as in disk. It is good practice to un-persist the RDD at the end, which lets us manually remove it from the cache.
There's more…
Spark defines various levels of persistence, such as MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_AND_DISK2
, and so on. Deciding when to cache/persist the data can be an art. The decision typically involves trade-offs between space and speed. If you attempt to cache too much data to fit in memory, Spark will use the LRU cache policy to evict old partitions. In general, RDDs should be persisted when they are likely to be referenced by multiple actions and are expensive to regenerate.
See also
Please refer to http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence to gain a detailed understanding of persistence in Spark.