Recovering from Production Failures
We live in an imperfect world. We first see bugs escape into our production environment. Then, we may find as we start moving to DevOps practices, there are gaps in our understanding that affect how we deliver in our production environment. As we get those fixed, we may encounter other problems that are outside our control. What can we possibly do?
In this chapter, we will examine mitigating and dealing with failures that happen in production environments. We will look at the following topics:
- The costs of errors in production environments
- Preventing as many errors as we can
- Practicing for failures using chaos engineering
- Resolving incidents in production with an incident management process
- Looking at fixing production failures by rolling back or fixing forward