Understanding Spark transformations and actions
In this section, we will discuss and talk about various transformation and action operations provided by Spark RDD APIs. We will also discuss about the different forms of RDD APIs.
RDD or Resilient Distributed Dataset is the core component of Spark. All operations for performing transformations on the raw data are provided in the different RDD APIs. We discussed RDD APIs and its features in the Resilient distributed datasets (RDD) section in Chapter 6, Getting Acquainted with Spark, but it is important to mention again that there is no API for accessing the raw dataset. The data in Spark can only be accessed by various operations exposed by the RDD APIs. RDDs are immutable datasets, so any transformation applied on the raw dataset, generates a new RDD without any modifications to the datasets/RDD on which transformation operations are invoked. Transformations in RDD are lazy, which means invocation of any transformation is not applied immediately...