Handling production incidents better
Every on-call engineer’s nightmare is getting a call in the middle of the night about production systems being down. Production incidents are common in every company, be it a two-engineer start-up or a 20,000-engineer big tech organization such as Google. An incident is defined as an event that causes disruption or degraded performance to the end user using a service. In the Integration APM tools section, we learned how to use APM tools to identify degraded performance and how to use uptime monitoring tools to identify any downtime and alert stakeholders. Now let us learn how to work during an incident and how to manage things better.
The first job of an on-call engineer is to make sure they don’t panic. Contrary to common opinion, I have observed that engineers who do not take incidents as a “do or die situation” can handle incidents much better. It is very difficult to have a generic approach to solving production...