Advanced transformations
As stated earlier in this book, if an RDD operation returns an RDD, then it is called a transformation. In Chapter 4, Understanding the Spark Programming Model, we learnt about commonly used useful transformations. Now we are going to look at some advanced level transformations.
mapPartitions
The working of this transformation is similar to map
transformation. However, instead of acting upon each element of the RDD, it acts upon each partition of the RDD. So, the map
function is executed once per RDD partition. Therefore, there will one-to-one mapping between partitions of the source RDD and the target RDD.
As a partition of an RDD is stored as a whole on a node, this transformation does not require shuffling.
In the following example, we will create an RDD of integers and increment all elements of the RDD by 1 using mapPartitions
:
JavaRDD<Integer> intRDD = jsc.parallelize(Arrays.asList(1,2,3,4,5,6,7,8,9,10),2);
Java 7:
intRDD.mapPartitions(new FlatMapFunction<...