Chaos engineering
Chaos engineering is a methodology devoted to building resilient systems by purposely trying to break them and expose their weaknesses. It is much better to deal with a problem when we expect it to happen. A well-thought-out plan needs to be in place to manage failure that can occur in any system. This plan should allow the system's recovery in a timely manner so that our customers and our leadership can continue to have confidence in our production systems.
A common refrain is that "we learn more from failure than we learn from success". Chaos engineering takes this refrain and applies it to computing infrastructure. However, instead of waiting for failure to occur, chaos engineering creates these failure conditions in a controlled manner to test the resiliency of our systems.
Systemic weaknesses can take many forms. Here are some examples:
- Insufficient or non-existent fallback mechanisms any time a service fails.
- Retry storms result from an outage and...