Another MapReduce example
We can use MapReduce in another example where we get the word counts from a file. A standard problem, but we use MapReduce to do most of the heavy lifting. We can use the source code for this example. We can use a script similar to this to count the word occurrences in a file:
import pysparkif not 'sc' in globals(): sc = pyspark.SparkContext()text_file = sc.textFile("Spark File Words.ipynb")counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)for x in counts.collect(): print x
Note
We have the same preamble to the coding.
Then we load the text file into memory.
Note
text_file
is a Spark RDD (Resilient Distributed Dataset), not a data frame.
It is assumed to be massive and the contents distributed over many handlers.
Once the file is loaded we split each line into words, and then use a lambda
function to tick off each occurrence of a word. The code is truly creating a new record...