Packt+ | Advance your knowledge in tech

You're reading from Optimizing Hadoop for MapReduce This book is the perfect introduction to sophisticated concepts in MapReduce and will ensure you have the knowledge to optimize job performance. This is not an academic treatise; it's an example-driven tutorial for the real world.

Product type Paperback

Published in Feb 2014

Publisher

ISBN-13 9781783285655

Length 120 pages

Edition Edition

Tools

Hadoop

Concepts

Data Processing

Author (1):

Khaled Tannir

View More author details

Table of Contents (15) Chapters

Optimizing Hadoop for MapReduce

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

1. Understanding Hadoop MapReduce

2. An Overview of the Hadoop Parameters FREE CHAPTER

3. Detecting System Bottlenecks

4. Identifying Resource Weaknesses

5. Enhancing Map and Reduce Tasks

6. Optimizing MapReduce Tasks

7. Best Practices and Recommendations

Index

The MapReduce model

MapReduce is a programming model designed for processing unstructured data by large clusters of commodity hardware and generating large datasets. It is capable of processing many terabytes of data on thousands of computing nodes in a cluster, handling failures, duplicating tasks, and aggregating results.

The MapReduce model is simple to understand. It was designed in the early 2000s by the engineers at Google Research (http://research.google.com/archive/mapreduce.html). It consists of two functions, a map function and a reduce function that can be executed in parallel on multiple machines.

To use MapReduce, the programmer writes a user-defined map function and a user-defined reduce function that expresses their desired computation. The map function reads a key/value pair, applies the user specific code, and produces results called intermediate results. Then, these intermediate results are aggregated by the reduce user-specific code that outputs the final results.

Input to a MapReduce application is organized in the records as per the input specification that will yield key/value pairs, each of which is a <k1, v1> pair.

Therefore, the MapReduce process consists of two main phases:

map(): The user-defined map function is applied to all input records one by one, and for each record it outputs a list of zero or more intermediate key/value pairs, that is, <k2, v2> records. Then all <k2, v2> records are collected and reorganized so that records with the same keys (k2) are put together into a <k2, list(v2)> record.
reduce(): The user-defined reduce function is called once for each distinct key in the map output, <k2, list(v2)> records, and for each record the reduce function outputs zero or more <k2, v3> pairs. All <k2, v3> pairs together coalesce into the final result.
Tip
The signatures of the map and reduce functions are as follows:
- map(<k1, v1>) list(<k2, v2>)
- reduce(<k2, list(v2)>) <k2, v3>

The MapReduce programming model is designed to be independent of storage systems. MapReduce reads key/value pairs from the underlying storage system through a reader. The reader retrieves each record from the storage system and wraps the record into a key/value pair for further processing. Users can add support for a new storage system by implementing a corresponding reader. This storage-independent design is considered to be beneficial for heterogeneous systems since it enables MapReduce to analyze data stored in different storage systems.

To understand the MapReduce programming model, let's assume you want to count the number of occurrences of each word in a given input file. Translated into a MapReduce job, the word-count job is defined by the following steps:

The input data is split into records.
Map functions process these records and produce key/value pairs for each word.
All key/value pairs that are output by the map function are merged together, grouped by a key, and sorted.
The intermediate results are transmitted to the reduce function, which will produce the final output.

The overall steps of this MapReduce application are represented in the following diagram:

While aggregating key/value pairs, a massive amount of I/O and network traffic I/O can be observed. To reduce the amount of network traffic required between the map and reduce steps, the programmer can optionally perform a map-side pre-aggregation by supplying a Combiner function. Combiner functions are similar to the reduce function, except that they are not passed all the values for a given key; instead, a Combiner function emits an output value that summarizes the input values it was passed.

The rest of the chapter is locked

You're reading from Optimizing Hadoop for MapReduce This book is the perfect introduction to sophisticated concepts in MapReduce and will ensure you have the knowledge to optimize job performance. This is not an academic treatise; it's an example-driven tutorial for the real world.

Table of Contents (15) Chapters

The MapReduce model

Tip

Authors (1)

Personalised recommendations for you

You're reading from Optimizing Hadoop for MapReduce This book is the perfect introduction to sophisticated concepts in MapReduce and will ensure you have the knowledge to optimize job performance. This is not an academic treatise; it's an example-driven tutorial for the real world.

Table of Contents (15) Chapters

The MapReduce model

Tip

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you