Monitoring strategy overview
Hadoop monitoring strategy is different from what you may use for traditional databases.When you have a cluster of hundreds of servers, failure of various components becomes a norm. If you will treat a failure of single DataNode as an emergency, there is a big chance that your monitoring system will be overloaded with false alerts.
Instead, it is important to outline which components are critical and failure of which components can be tolerated (up to a certain point). For critical components, you will need to define rules, which will alert on call personnel right away. For non-critical components, regular reports on the overall system status should be enough.
You should already have an idea about Hadoop components whose failure should be treated as an emergency. Failure of NameNode or a JobTracker will make cluster unusable and should be investigated right away. Even if you have High Availability configured for these components, it is still important to find out...