Operations on RDD
Two major operation types can be performed on an RDD. They are called:
- Transformations
- Actions
Transformations
Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence are lazily evaluated, which is one of the main benefits of Spark.
An example of a transformation would be a map
function that will pass through each element of the RDD and return a totally new RDD representing the results of application of the function on the original dataset.
Actions
Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion...