Chapter 4. Developing MapReduce Programs
Now that we have explored the technology of MapReduce, we will spend this chapter looking at how to put it to use. In particular, we will take a more substantial dataset and look at ways to approach its analysis by using the tools provided by MapReduce.
In this chapter we will cover the following topics:
Hadoop Streaming and its uses
The UFO sighting dataset
Using Streaming as a development/debugging tool
Using multiple mappers in a single job
Efficiently sharing utility files and data across the cluster
Reporting job and task status and log information useful for debugging
Throughout this chapter, the goal is to introduce both concrete tools and ideas about how to approach the analysis of a new data set. We shall start by looking at how to use scripting programming languages to aid MapReduce prototyping and initial analysis. Though it may seem strange to learn the Java API in the previous chapter and immediately move to different languages, our goal here...