Caching and checkpointing are some of the important features of Spark. These operations can improve the performance of your Spark jobs significantly.
Caching and checkpointing
Caching
Caching data into memory is one of the main features of Spark. You can cache large datasets in-memory or on-disk depending upon your cluster hardware. You can choose to cache your data in two scenarios:
- Use the same RDD multiple times
- Avoid reoccupation of an RDD that involves heavy computation, such as join() and groupByKey()
If you want to run multiple actions of an RDD, then it will be a good idea to cache it into the memory so that recompilation of this RDD can be avoided. For example, the following code first takes out a few elements...