Implementing MapReduce to count word frequencies
MapReduce is a framework for efficient parallel algorithms that take advantage of divide and conquer. If a task can be split into smaller tasks, and the results of each individual task can be combined to form the final answer, then MapReduce is likely the best framework for this job.
In the following figure, we can see that a large list is split up, and the mapper functions work in parallel on each split. After all the mapping is complete, the second phase of the framework kicks in, reducing the various calculations into one final answer.
In this recipe, we will be counting word frequencies in a large corpus of text. Given many files of words, we will apply the MapReduce framework to find the word frequencies in parallel.
Getting ready
Install the parallel
package using cabal as follows:
$ cabal install parallel
Create multiple files with words. In this recipe, we download a huge text file and split it up using the UNIX split
command as follows...