Performing MapReduce with Jug
Jug is a distributed computing framework that uses tasks as central parallelization units. Jug uses filesystems or the Redis server as backends. The Redis server was discussed in Chapter 8, Working with Databases. Install Jug with the following command:
$ pip3 install jug
MapReduce (see http://en.wikipedia.org/wiki/MapReduce) is a distributed algorithm used to process large datasets with a cluster of computers. The algorithm consists of a Map and a Reduce phase. During the Map phase, data is processed in a parallel fashion. The data is split up into parts, and on each part, filtering or other operations are performed. In the Reduce phase, the results from the Map phase are aggregated, for instance, to create a statistics report.
If we have a list of text files, we can compute word counts for each file. This can be done during the Map phase. At the end, we can combine individual word counts into a corpus word frequency dictionary. Jug has MapReduce functionality...