Time for action – summarizing the UFO data
Now we have the data, let's get an initial summarization of its size and how many records may be incomplete:
With the UFO tab-separated value (TSV) file on HDFS saved as
ufo.tsv
, save the following file tosummarymapper.rb
:#!/usr/bin/env ruby while line = gets puts "total\t1" parts = line.split("\t") puts "badline\t1" if parts.size != 6 puts "sighted\t1" if !parts[0].empty? puts "recorded\t1" if !parts[1].empty? puts "location\t1" if !parts[2].empty? puts "shape\t1" if !parts[3].empty? puts "duration\t1" if !parts[4].empty? puts "description\t1" if !parts[5].empty? end
Make the file executable by executing the following command:
$ chmod +x summarymapper.rb
Execute the job as follows by using Streaming:
$ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -file summarymapper.rb -mapper summarymapper.rb -file wcreducer.rb -reducer wcreducer.rb -input ufo.tsv -output ufosummary
Retrieve the summary data...