Caching enables Spark to persist data across computations and operations. In fact, this is one of the most important technique in Spark to speed up computations, particularly when dealing with iterative computations.
Caching works by storing the RDD as much as possible in the memory. If there is not enough memory then the current data in storage is evicted, as per LRU policy. If the data being asked to cache is larger than the memory available, the performance will come down because Disk will be used instead of memory.
You can mark an RDD as cached using either persist() or cache()
cache() is simply a synonym for persist(MEMORY_ONLY)
persist can use memory or disk or both:
persist(newLevel: StorageLevel)
The following are the possible values for Storage level:
Storage Level | Meaning |
---|---|
MEMORY_ONLY | Stores RDD as deserialized Java objects in the JVM. If the RDD does not... |