You're reading from Learning Hadoop 2 Design and implement data processing, lifecycle management, and analytic workflows with the cutting-edge toolbox of Hadoop 2

Product type Paperback

Published in Feb 2015

Publisher Packt

ISBN-13 9781783285518

Length 382 pages

Edition 1st Edition

Tools

Hadoop

Concepts

Data Processing

Table of Contents (13) Chapters

Preface

1. Introduction FREE CHAPTER

2. Storage

3. Processing – MapReduce and Beyond

4. Real-time Computation with Samza

5. Iterative Computation with Spark

6. Data Analysis with Apache Pig

7. Hadoop and SQL

8. Data Lifecycle Management

9. Making Development Easier

10. Running a Hadoop Cluster

11. Where to Go Next

Index

AWS – infrastructure on demand from Amazon

AWS is a set of cloud-computing services offered by Amazon. We will use several of these services in this book.

Simple Storage Service (S3)

Amazon's Simple Storage Service (S3), found at http://aws.amazon.com/s3/, is a storage service that provides a simple key-value storage model. Using web, command-line, or programmatic interfaces to create objects, which can be anything from text files to images to MP3s, you can store and retrieve your data based on a hierarchical model. In this model, you create buckets that contain objects. Each bucket has a unique identifier, and within each bucket, every object is uniquely named. This simple strategy enables an extremely powerful service for which Amazon takes complete responsibility (for service scaling, in addition to reliability and availability of data).

Elastic MapReduce (EMR)

Amazon's Elastic MapReduce, found at http://aws.amazon.com/elasticmapreduce/, is basically Hadoop in the cloud. Using any of the multiple interfaces (web console, CLI, or API), a Hadoop workflow is defined with attributes such as the number of Hadoop hosts required and the location of the source data. The Hadoop code implementing the MapReduce jobs is provided, and the virtual Go button is pressed.

In its most impressive mode, EMR can pull source data from S3, process it on a Hadoop cluster it creates on Amazon's virtual host on-demand service EC2, push the results back into S3, and terminate the Hadoop cluster and the EC2 virtual machines hosting it. Naturally, each of these services has a cost (usually on per GB stored and server-time usage basis), but the ability to access such powerful data-processing capabilities with no need for dedicated hardware is a powerful one.

You're reading from Learning Hadoop 2 Design and implement data processing, lifecycle management, and analytic workflows with the cutting-edge toolbox of Hadoop 2

Table of Contents (13) Chapters

AWS – infrastructure on demand from Amazon

Simple Storage Service (S3)

Elastic MapReduce (EMR)

Personalised recommendations for you