Packt+ | Advance your knowledge in tech

You're reading from Optimizing Hadoop for MapReduce This book is the perfect introduction to sophisticated concepts in MapReduce and will ensure you have the knowledge to optimize job performance. This is not an academic treatise; it's an example-driven tutorial for the real world.

Product type Paperback

Published in Feb 2014

Publisher

ISBN-13 9781783285655

Length 120 pages

Edition Edition

Tools

Hadoop

Concepts

Data Processing

Author (1):

Khaled Tannir

View More author details

Table of Contents (15) Chapters

Optimizing Hadoop for MapReduce

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

1. Understanding Hadoop MapReduce FREE CHAPTER

2. An Overview of the Hadoop Parameters

3. Detecting System Bottlenecks

4. Identifying Resource Weaknesses

5. Enhancing Map and Reduce Tasks

6. Optimizing MapReduce Tasks

7. Best Practices and Recommendations

Index

An overview of Hadoop MapReduce

Hadoop is the most popular open source Java implementation of the MapReduce programming model proposed by Google. There are many other implementations of MapReduce (such as Sphere, Starfish, Riak , and so on), which implement all the features described in the Google documentation or only a subset of these features.

Hadoop consists of distributed data storage engine and MapReduce execution engine. It has been successfully used for processing highly distributable problems across a large amount of datasets using a large number of nodes. These nodes collectively form a Hadoop cluster, which in turn consists of a single master node called JobTracker, and multiple worker (or slave) nodes; each worker node is called a TaskTracker. In this framework, a user program is called a job and is divided into two steps: map and reduce.

Like in the MapReduce programing model, the user has to only define the map and reduce functions when using the Hadoop MapReduce implementation. The Hadoop MapReduce system automatically parallelizes the execution of these functions and ensures fault tolerance.

Tip

To learn more about the Hadoop MapReduce implementation, you can browse Hadoop's official website at http://hadoop.apache.org/.

Basically, the Hadoop MapReduce framework utilizes a distributed filesystem to read and write its data. This distributed filesystem is called Hadoop Distributed File System (HDFS), which is the open source counterpart of the Google File System (GFS). Therefore, the I/O performance of a Hadoop MapReduce job strongly depends on HDFS.

HDFS consists of a master node called NameNode, and slave nodes called DataNodes. Within the HDFS, data is divided into fixed-size blocks (chunks) and spread across all DataNodes in the cluster. Each data block is typically replicated with two replicas: one placed within the same rack and the other placed outside it. NameNode keeps track of which DataNodes hold replicas of which block.

You're reading from Optimizing Hadoop for MapReduce This book is the perfect introduction to sophisticated concepts in MapReduce and will ensure you have the knowledge to optimize job performance. This is not an academic treatise; it's an example-driven tutorial for the real world.

Table of Contents (15) Chapters

An overview of Hadoop MapReduce

Tip

Authors (1)

Personalised recommendations for you