Amazon Elastic MapReduce
AWS is probably one of the most popular public clouds at the moment. It allows users to quickly provision virtual servers on demand and discard them when they are no longer required. While Hadoop was not originally designed to run in such environments, the ability to create large clusters for specific tasks is very appealing in many use cases.
Imagine you need to process application logfiles and prepare data to be loaded in relational databases. If this task takes a couple of hours and runs only once a day, there is little reason to keep the Hadoop cluster running all the time, as it would be idle most of the time. In this case, a more practical solution would be to provision a virtual cluster using Elastic MapReduce (EMR) and destroy it after the work is done.
EMR clusters don't have to be destroyed and recreated from scratch every time. You can choose to keep the cluster running and use it for interactive Hive queries, and so on.
We will now take you through the steps...