Writing MapReduce programs
In this chapter, we will be focusing on batch workloads; given a set of historical data, we will look at properties of that dataset. In Chapter 4, Real-time Computation with Samza, and Chapter 5, Iterative Computation with Spark, we will show how a similar type of analysis can be performed over a stream of text collected in real time.
Getting started
In the following examples, we will assume a dataset generated by collecting 1,000 tweets using the stream.py
script, as shown in Chapter 1, Introduction:
$ python stream.py –t –n 1000 > tweets.txt
We can then copy the dataset into HDFS with:
$ hdfs dfs -put tweets.txt <destination>
Tip
Note that until now we have been working only with the text of tweets. In the remainder of this book, we'll extend stream.py
to output additional tweet metadata in JSON format. Keep this in mind before dumping terabytes of messages with stream.py
.
Our first MapReduce program will be the canonical WordCount example...