Using circuit breakers
Failures in distributed systems can be difficult to debug. A symptom (spikes in latency or a high error rate) can appear far away from the underlying cause (slow database query, garbage collection cycles causing a service to slow down the processing of requests). Sometimes a complete outage can be the result of a failure in a small part of the system, especially when components of the system are having difficulty handling increases in load.
Whenever possible, we want to prevent failures in one part of a system from cascading to other parts, causing widespread and hard-to-debug production issues. Furthermore, if a failure is temporary, we'd like our system to be able to self-repair when the failure is over. If a specific service is experiencing problems because of a temporary spike in load, we should design our system in such a way that it prevents requests to the unhealthy service, allowing it time to recover before beginning to send it traffic again.Â
Circuit breakers...