MapReduce
Massive parallel processing of large datasets is a complex process. MapReduce simplifies this by providing a design pattern that instructs algorithms to be expressed in map and reduce phases. Map can be used to perform simple transformations on data, and reduce is used to group data together and perform aggregations.
By chaining together a number of map and reduce phases, sophisticated algorithms can be achieved. The shared nothing architecture of MapReduce prohibits communication between map tasks of the same phase or reduces tasks of the same phase. Communication that's required happens at the end of each phase.
The simplicity of this model allows Hadoop to translate each phase, depending on the amount of data that needs to be processed into tens or even hundreds of tasks being executed in parallel, thus achieving scalable performance.
Internally, the map and reduce tasks follow a simplistic data representation. Everything is a key or a value. A map task receives key-value pairs and applies basic transformations emitting new key-value pairs. Data is then partitioned and different partitions are transmitted to different reduce tasks. A reduce task also receives key-value pairs, groups them based on the key, and applies basic transformation to those groups.
A MapReduce example
To illustrate how MapReduce works, let's look at an example of a log file of total size 1 GB with the following format:
INFO MyApp - Entering application. WARNING com.foo.Bar - Timeout accessing DB - Retrying ERROR com.foo.Bar - Did it again! INFO MyApp - Exiting application
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Once this file is stored in HDFS, it is split into eight 128 MB blocks and distributed in multiple Hadoop nodes. In order to build a MapReduce job to count the amount of INFO, WARNING, and ERROR log lines in the file, we need to think in terms of map and reduce phases.
In one map phase, we can read local blocks of the file and map each line to a key and a value. We can use the log level as the key and the number 1
as the value. After it is completed, data is partitioned based on the key and transmitted to the reduce tasks.
MapReduce guarantees that the input to every reducer is sorted by key. Shuffle is the process of sorting and copying the output of the map tasks to the reducers to be used as input. By setting the value to 1
on the map phase, we can easily calculate the total in the reduce phase. Reducers receive input sorted by key, aggregate counters, and store results.
In the following diagram, every green block represents an INFO message, every yellow block a WARNING message, and every red block an ERROR message:
Implementing the preceding MapReduce algorithm in Java requires the following three classes:
- A Map class to map lines into
<key,value>
pairs; for example,<"INFO",1>
- A Reduce class to aggregate counters
- A Job configuration class to define input and output types for all
<key,value>
pairs and the input and output files