Failure in distributed systems is inevitable. Instead of trying to prevent failure entirely, we want to design systems that are capable of self-repair. To accomplish this, it is essential to have a good strategy for clients to follow when initiating retries. A service may become temporarily unavailable or experience a problem that requires manual response from an on-call engineer. In either scenario, clients should be able to queue and then retry requests to be given the best chance of success.Â
Retrying endlessly in the event of an error is not an effective tactic. Imagine a service starts to experience a higher-than-normal failure rate, perhaps even failing 100% of requests. If clients all continuously enqueue retries without ever giving up, you'll end up with a thundering-herd problem—clients continuously retrying...