Sorted word count
Using the same script with a slight modification, we can make one more call and have sorted results. The script now looks like this:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() text_file = sc.textFile("Spark File Words.ipynb") sorted_counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .sortByKey() for x in sorted_counts.collect(): print x
Here, we have added another function call to the RDD creation, sortByKey()
. So, after we have map/reduced and arrived at list of words and occurrence, we can easily sort the results.
The resultant output looks like this: