Chapter 6. When Things Break
One of the main promises of Hadoop is resilience to failure and an ability to survive failures when they do happen. Tolerance to failure will be the focus of this chapter.
In particular, we will cover the following topics:
How Hadoop handles failures of DataNodes and TaskTrackers
How Hadoop handles failures of the NameNode and JobTracker
The impact of hardware failure on Hadoop
How to deal with task failures caused by software bugs
How dirty data can cause tasks to fail and what to do about it
Along the way, we will deepen our understanding of how the various components of Hadoop fit together and identify some areas of best practice.