Time for action – summarizing the shape data
Just as we provided a summarization for the overall UFO data set earlier, let's now do a more focused summarization on the data provided for UFO shapes:
Save the following to
shapemapper.rb
:#!/usr/bin/env ruby while line = gets parts = line.split("\t") if parts.size == 6 shape = parts[3].strip puts shape+"\t1" if !shape.empty? end end
Make the file executable:
$ chmod +x shapemapper.rb
Execute the job once again using the WordCount reducer:
$ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jarr --file shapemapper.rb -mapper shapemapper.rb -file wcreducer.rb -reducer wcreducer.rb -input ufo.tsv -output shapes
Retrieve the shape info:
$ hadoop fs -cat shapes/part-00000
What just happened?
Our mapper here is pretty simple. It breaks each record into its constituent fields, discards any without exactly six fields, and gives a counter as the output for any non-empty shape value...