Spark word count
Now that we have seen some of the functionality, let's explore further. We can use a script similar to the following to count the word occurrences in a file:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() #load in the file text_file = sc.textFile("Spark File Words.ipynb") #split file into distinct words counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) # print out words found for x in counts.collect(): print(x)
We have the same preamble to the coding. Then, we load the text file into memory.
Once the file is loaded, we split each line into words and use a lambda
function to tick off each occurrence of a word. The code is truly creating a new record for each word occurrence, such as at appears one. The idea is that this process could be split over multiple processors, where each processor generates these low-level information bits. We are not concerned with optimizing...