A word count program in Hadoop
Perhaps the simplest way to get started with understanding programming for Hadoop is a simple word count functionality on a fairly large electronic book. The map program will read in every line of the text separated by a space or tab and return a key-value pair, which is by default assigned to a count of 1. The reduce program will read in all key-value pairs from the map program and sum up the number of similar words. Hadoop will produce an output file that contains a list of words in the book and the number of times the words have appeared.
Downloading sample data
Project Gutenberg hosts over 100,000 free e-books in HTML, EPUB, Kindle, and plain-text UTF-8 formats. For our testing with a sample e-book, let's use Ulysses by James Joyce. The link for the plain text UTF-8 file is http://www.gutenberg.org/ebooks/4300.txt.utf-8. Using Firefox or any other web browser available in the CentOS virtual machine, you can download the file from the URL, and save it as pg4300...