Responding and recovering when disaster strikes
The ability to recover quickly is a key feature of a DevOps approach. One of the key DORA metrics is Time to Restore Service, with elite functioning DevOps organizations able to do so in minutes. Preparing for recovery is a major part of the CALMR approach.
To facilitate recovery, we look at the following practices:
- Proactive detection
- Cross-team collaboration
- Chaos engineering
- Session replay
- Rollback and fix forward
- Immutable infrastructure
A proactive response is important in production because this is the environment where the end users are. Problems here are visible and affect our customers. Problems not immediately handled can affect other work in other parts of the Continuous Delivery Pipeline.
Let’s examine the practices that allow us to be proactive in the production environment.
Proactive detection
Because we are using feature flags to separate deployment from release, we...