Manipulating your RDD in Python
Spark has a more limited Python API than Java and Scala, but it supports most of the core functionality.
The hallmark of a MapReduce
system lies in two commands: map
and reduce
. You've seen the map
function used in the earlier chapters. The map
function works by taking in a function that works on each individual element in the input RDD and produces a new output element. For example, to produce a new RDD where you have added one to every number, you would use rdd.map(lambda x: x+1)
. It's important to understand that the map
function and the other Spark functions do not transform the existing elements; instead, they return a new RDD with new elements. The reduce
function takes a function that operates in pairs to combine all of the data. This is returned to the calling program. If you were to sum all the elements, you would use rdd.reduce(lambda x, y: x+y)
. The flatMap
function is a useful utility function that allows you to write a function that returns an...