Obtaining a word count from a big-text data source
While this is not a big data source, we will show how to get a word count from a text file first. Then we'll find a larger data file to work with.
How to do it...
We can use this script to see the word counts for a file:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() text_file = sc.textFile("B09656_09_word_count.ipynb") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) for x in counts.collect(): print(x)
When we run this in Jupyter, we see something akin to this display:
The display continues for every individual word that was detected in the source file.
How it works...
We have a standard preamble to the coding. All Spark programs need a context to work with. The context is used to define the number of threads and the like. We are only using the defaults. It's important to note that Spark will automatically utilize underlying multiple...