Manipulating your RDD in Scala and Java
RDD is the primary low-level abstraction in Spark. As we discussed in the last chapter, the main programming APIs will be Datasets/DataFrames. However, underneath it all, the data will be represented as RDDs. So, understanding and working with RDDs is important. From a structural view, RDDs are just a bunch of elements-elements that can be operated in parallel.
RDD stands for Resilient Distributed Dataset, that is, it is distributed over a set of machines and the transformations are captured so that an RDD can be recreated in case there is a machine failure or memory corruption. One important aspect of the distributed parallel data representation scheme is that RDDs are immutable, which means when you do an operation, it generates a new RDD. Manipulating your RDD in Scala is quite simple, especially if you are familiar with Scala's collection library. Many of the standard functions are available directly on Spark's RDDs with the primary catch being...