Adapting to Failure
After you have designed your solution to withstand failure, you need to test that it’s actually as resilient as you expect. The following subsections discuss the various ways in which you can test your solutions.
Using Playbooks to Investigate Failures
Whenever a failure occurs, you want to react consistently and remediate promptly. This is what playbooks are about. They provide you with a list of steps to be followed in order to fully identify the issue and address it effectively.
When an undocumented failure scenario occurs, focus on addressing the issue first. After the fire has been put out, come back to the playbook and update it by adding the new scenario along with the exact steps that you took to address the issue.
Performing Post-Incident Analysis
Performing post-incident analysis, also known as post-mortem analysis, is essential in the process of failure management. It helps you potentially prevent a recurrence of the failure. The...