[box type="note" align="" class="" width=""]The article given below is a book extract from Java Data Analysis written by John R. Hubbard. The book will give you the most out of popular Java libraries and tools to perform efficient data analysis.[/box]
In this article, we will explore Google’s MapReduce framework to analyze big data.
How do you quickly sort a list of billion elements? Or multiply two matrices, each with a million rows and a million columns?
In implementing their PageRank algorithm, Google quickly discovered the need for a systematic framework for processing massive datasets. That could be done only by distributing the data and the processing over many storage units and processors. Implementing a single algorithm, such as PageRank in that environment is difficult, and maintaining the implementation as the dataset grows is even more challenging.
The answer lay in separating the software into two levels: a framework that manages the big data access and parallel processing at a lower level, and a couple of user-written methods at an upper-level. The independent user who writes the two methods need not be concerned with the details of the big data management at the lower level.
Specifically, the data flows through a sequence of stages:
The user's choice of map()
and reduce()
methods determines the outcome of the entire process; hence the name MapReduce.
This idea is a variation on the old algorithmic paradigm called divide and conquer. Think of the proto-typical mergesort, where an array is sorted by repeatedly dividing it into two halves until the pieces have only one element, and then they are systematically pairwise merged back together.
MapReduce is actually a meta-algorithm—a framework, within which specific algorithms can be implemented through its map()
and reduce()
methods. Extremely powerful, it has been used to sort a petabyte of data in only a few hours. Recall that a petabyte is 10005 = 1015 bytes, which is a thousand terabytes or a million gigabytes.
Here are a few examples of big data problems that can be solved with the MapReduce framework:
map()
function. In this section, we present the MapReduce solution to the WordCount problem, sometimes called the Hello World example for MapReduce.
The diagram in the figure below shows the data flow for the WordCount program. On the left are two of the 80 files that are read into the program:
During the mapping stage, each word, followed by the number 1, is copied into a temporary file, one pair per line. Notice that many words are duplicated many times. For example, image
appears five times among the 80 files (including both files shown), so the string image 1
will appear four times in the temporary file. Each of the input files has about 110 words, so over 8,000 word-number pairs will be written to the temporary file.
Note that this figure shows only a very small part of the data involved. The output from the mapping stage includes every word that is input, as many times that it appears. And the output from the grouping stage includes every one of those words, but without duplication.
The grouping process reads all the words from the temporary file into a key-value hash table, where the key is the word, and the value is a string of 1s, one for each occurrence of that word in the temporary file. Notice that these 1s written to the temporary file are not used. They are included simply because the MapReduce framework in general expects the map()
function to generate key-value pairs.The reducing stage transcribed the contents of the hash table to an output file, replacing each string of 1s with the number of them. For example, the key-value pair ("book", "1 1 1 1")
is written as book 4
in the output file.
Keep in mind that this is a toy example of the MapReduce process. The input consists of 80 text files containing about 9073 words. So, the temporary file has 9073 lines, with one word per line. Only 2149 of those words are distinct, so the hash table has 2149 entries and the output file has 2149 lines, with one word per line.
So, this is the main idea of the MapReduce meta-algorithm: provide a framework for processing massive datasets, a framework that allows the independent programmer to plug in specialized map()
and reduce()
methods that actually implement the required particular algorithm. If that particular algorithm is to count words, then write the map()
method to extract each individual word from a specified file and write the key-value pair (word, 1)
to wherever the specified writer
will put them, and write the reduce()
method to take a key-value pair such as
(word, 1 1 1 1)
and return the corresponding key-value pair as (word, 4)
to wherever its specified writer
will put it. These two methods are completely localized—they simply operate on key-value pairs. And, they are completely independent of the size of the dataset.
The diagram below illustrates the general flow of data through an application of the MapReduce framework:
The original dataset could be in various forms and locations: a few files in a local directory, a large collection of files distributed over several nodes on the same cluster, a database on a database system (relational or NoSQL), or data sources available on the World Wide Web. The MapReduce controller then carries out these five tasks:
map()
method, one on each dataset, producing a set of key-value pairs in a temporary file on that local machine.reduce()
method, one on each of the temporary files, producing one output file on each machine. reduce()
method also sorts its output, then this last step could also include merging those outputs.The genius of the MapReduce framework is that it separates the data management (moving, partitioning, grouping, sorting, and so on) from the data crunching (counting, averaging, maximizing, and so on). The former is done with no attention required by the user. The latter is done in parallel, separately in each node, by invoking the two user-supplied methods map()
and reduce()
. Essentially, the only obligation of the user is to devise the correct implementations of these two methods that will solve the given problem.
As we mentioned earlier, these examples are presented mainly to elucidate how the MapReduce algorithm works. Real-world implementations would, however, use MongoDB or Hadoop frameworks.
If you enjoyed this excerpt, check out the book Java Data Analysis to get an understanding of the various data analysis techniques, and how to implement them using Java.