Understanding the difference between transformations and actions
When working with data and sets of data in Spark with Scala, it’s helpful to understand how and when execution takes place on your Spark cluster. Spark by design is lazy, meaning that it doesn’t transform your data until absolutely necessary. This is so that it can run a batch of transactions together and apply optimizations to help improve the processing time.
Transformations are code statements that are lazily executed. A ledger of transformations is tracked until Spark sees a code statement called an action. The action tells Spark it’s time to execute all the transformations. Transformations are code that returns an RDD (short for resilient distributed dataset), dataset, or DataFrame. An action is code that returns some kind of value using the dataset you are processing.
Examples of action functions are as follows:
count
show
write
head
take
The following are...