Building a cluster on EMR
Elastic MapReduce is a flexible solution that, depending on requirements and workloads, can sit next to, or replace, a physical Hadoop cluster. As we've seen so far, EMR provides clusters preloaded and configured with Hive, Streaming, and Pig as well as with custom JAR clusters that allow the execution of MapReduce applications.
A second distinction to make is between transient and long-running life cycles. A transient EMR cluster is generated on demand; data is loaded in S3 or HDFS, some processing workflow is executed, output results are stored, and the cluster is automatically shut down. A long-running cluster is kept alive once the workflow terminates, and the cluster remains available for new data to be copied over and new workflows to be executed. Long-running clusters are typically well-suited for data warehousing or working with datasets large enough that loading and processing data would be inefficient compared to a transient instance.
In a must-read white...