Fault-tolerance components and mechanism in Apex
In Chapter 2, Getting Started with Application Development, we looked at the deployment of an Apex application when it is executing on a YARN cluster. Let's revisit the diagram to see which type of failures may occur and how they are handled by the system:
The client is only required for launching the application; it is not involved in the execution of the DAG on the cluster, and failure of the client node does not affect the pipeline. Since Apex is running on YARN, let's first see how YARN supports resilient applications (from a user's perspective).
YARN consists of a resource manager (RM) and node managers (NM). Each YARN cluster node has a node manager service running, which communicates with the resource manager.
- Failure of an NM: When a node manager fails (regardless of software or hardware failure), the RM will detect this and ensure that the affected containers can be allocated on different machines. The RM itself cannot recover those...