Packt+ | Advance your knowledge in tech

You're reading from Hadoop Beginner's Guide Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services ‚Äì just a willingness to learn the basics from this practical step-by-step tutorial.

Product type Paperback

Published in Feb 2013

Publisher Packt

ISBN-13 9781849517300

Length 398 pages

Edition 1st Edition

Tools

Hadoop

Concepts

Data Processing

Table of Contents (19) Chapters

Hadoop Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. What It's All About FREE CHAPTER

2. Getting Hadoop Up and Running

3. Understanding MapReduce

4. Developing MapReduce Programs

5. Advanced MapReduce Techniques

6. When Things Break

7. Keeping Things Running

8. A Relational View on Data with Hive

9. Working with Relational Databases

10. Data Collection with Flume

11. Where to Go Next

Pop Quiz Answers

Index

Chapter 7, Keeping Things Running

Pop quiz – setting up a cluster

5. Though some general guidelines are possible and you may need to generalize whether your cluster will be running a variety of jobs, the best fit depends on the anticipated workload.

4. Network storage comes in many flavors but in many cases you may find a large Hadoop cluster of hundreds of hosts reliant on a single (or usually a pair) of storage devices. This adds a new failure scenario to the cluster and one with a less uncommon likelihood than many others. Where storage technology does look to address failure mitigation it is usually through disk-level redundancy. These disk arrays can be highly performant but will usually have a penalty on either reads or writes. Giving Hadoop control of its own failure handling and allowing it full parallel access to the same number of disks is likely to give higher overall performance.

3. Probably! We would suggest avoiding the first configuration as, though it has just enough raw storage and is far from underpowered, there is a good chance the setup will provide little room for growth. An increase in data volumes would immediately require new hosts and additional complexity in the MapReduce job could require additional processor power or memory.

Configurations B and C both look good as they have surplus storage for growth and provide similar head-room for both processor and memory. B will have the higher disk I/O and C the better CPU performance. Since the primary job is involved in financial modelling and forecasting, we expect each task to be reasonably heavyweight in terms of CPU and memory needs. Configuration B may have higher I/O but if the processors are running at 100 percent utilization it is likely the extra disk throughput will not be used. So the hosts with greater processor power are likely the better fit.

Configuration D is more than adequate for the task and we don’t choose it for that very reason; why buy more capacity than we know we need?