Chaos-engineering is a technique that's used to evaluate systems for fragility and building constructs to help a system survive such chaos. Instead of waiting for things to break at the worst possible time, chaos-engineering believes in proactively injecting/crafting failures in order to gauge how the system behaves in these scenarios. Thus, disaster striking is not a once-in-a-blue-moon event- it happens every day! The aim is to identify weaknesses before they manifest in surprising aberrant behaviors. These weaknesses could be things such as the following:
- Improper fallback settings (see the Dependency resilience section)
- Retry thundering herds from incorrectly set timeouts
- Dependencies that are not resilient
- Single Points of Failure
- Cascading failures
Once identified, with proper telemetry in place, these weaknesses can be fixed before they bring customers...