Plugging into a MapReduce framework
For background on the Apache Hadoop server, see https://hadoop.apache.org. Here's the summary:
The Apache Hadoop software library is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
One part of the Hadoop distributed processing is the MapReduce module. This module allows us to decompose analysis of data into two complementary operations: map and reduce. These operations are distributed around the Hadoop cluster to be run concurrently. A map operation processes all of the rows of datasets that are scattered around the cluster. The outputs from map operations are then fed to reduce operations to be summarized.
The Hadoop streaming interface can be used by Python programmers. This involves a Hadoop "wrapper" that will present the data to a Python...