As discussed in Chapter 2, Getting Started with Application Development, Apex applications run on a cluster as a distributed set of processes. Distribution enables scalability, and with the correct architecture, adding more resources to a compute cluster (such as YARN) allows applications to scale horizontally (refer to Chapter 4, Scalability, Low Latency, and Performance). At the same time, growing number of processes and machines also increases the likelihood of failure. Hardware or software failures cannot be avoided.
In order to prevent failure resulting in downtime or incorrect results, the system has to be resilient. Fault-tolerance mechanisms should cover high availability (HA) as well as provide with processing correctness guarantee to the user. For a production-quality and business-critical system, these aspects should be important...