Obtaining a sorted word count from a big-text source
Now that we have a word count, the more interesting use is to sort them by occurrence to determine the highest usage.
How to do it...
We can slightly modify the previous script to produce a sorted listed as follows:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() text_file = sc.textFile("B09656_09_word_count.ipynb") sorted_counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .sortByKey() for x in sorted_counts.collect(): print(x)
Producing the output as follows:
The list continues for every word found. Notice the descending order of occurrences and the sorting with words of the same occurrence. What Spark uses to determine word breaks does not appear to be too good.
How it works...
The coding is exactly the same as in the previous example, except for the last line, .sortByKey()
. Our key, by default, is the word count column (as that is what we...