NodeManager failures
Almost all nodes in the cluster runs a NodeManager service daemon. The NodeManager takes care of executing a certain part of a YARN job on every individual machine, while other parts are executed on other nodes. For a 1000 node YARN cluster, there are probably around 999 node managers running. So node managers are indeed a per-node agent and takes care of the individual nodes distributed in the cluster.
If a Node Manager fails, the ResourceManager detects this failure using a time-out (that is, stops receiving the heartbeats from the NodeManager). The ResourceManager then removes the NodeManager from its pool of available NodeManagers. It also kills all the containers running on that node & reports the failure to all running AMs. AMs are then responsible for reacting to node failures, by redoing the work done by any containers running on that node during the fault.
If the fault causing the time-out is transient then the Node Manager will resynchronizes with the ResourceManager...