Since Spark excels at processing in-memory data, we will first remove our intermediary data and then cache our out_sd dataframe, so that subsequent queries run much faster. Caching data in memory works best when similar types of queries are repeated. In that way, Spark is able to know how to juggle memory so that most of what you need resides in memory.
However, this is not foolproof. Good Spark query and table design will help with optimization, but out-of-the-box caching usually gives some benefit. Often, the first queries will not benefit from memory caching, but subsequent queries will run much faster.
Since we will no longer use the intermediary dataframes we created, we will remove them with the rm function, and then use the cache() function on the full dataframe:
#cleanup and cache df...
rm(out_sd1)
rm(out_sd2)
cache(out_sd)