Handling failures in YARN
A successful execution of a YARN application depends on robust coordination of all the YARN components, including containers, ApplicationMaster, NodeManager, and ResourceManager. Any fault in the coordination of the components or lack of sufficient cluster resource can cause the application to fail. The YARN framework is robust in terms of handling failures at different stages in the application execution path. The fault tolerance and recovery of the application depends on its current stage of execution and the component in which the problem occurs. The following section explains the recovery mechanism applied by YARN at component level.
The container failure
Containers are instantiated for executing the map or reduce tasks. As mentioned in the previous section, these containers in Hadoop-YARN are Java processes running as YarnChild processes. There could be some exception in the execution or abnormal termination of JVM due to lack of sufficient resources. The failure...