Analyzing a large dataset
Armed with our abilities to write MapReduce jobs in both Java and Streaming, we'll now explore a more significant dataset than any we've looked at before. In the following section, we will attempt to show how to approach such analysis and the sorts of questions Hadoop allows you to ask of a large dataset.
Getting the UFO sighting dataset
We will use a public domain dataset of over 60,000 UFO sightings. This is hosted by InfoChimps at http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada.
You will need to register for a free InfoChimps account to download a copy of the data.
The data comprises a series of UFO sighting records with the following fields:
Sighting date: This field gives the date when the UFO sighting occurred.
Recorded date: This field gives the date when the sighting was reported, often different to the sighting date.
Location: This field gives the location where the sighting occurred.
Shape: This field gives...