Analyzing past postmortems
Once everything is said and done, it is good to go back and review past postmortems. Once a quarter, or once a year, collect all of the postmortems and try and pull together some metrics. These metrics can help to give you an insight into what your team is doing to respond to incidents:
- Time to recovery
- Time between failures
- Number of alerts fired versus postmortems generated
- Number of alerts fired per on-call rotation
MTTR and MTBF
Outside of incidents, two metrics that are often talked about are mean time to recovery (MTTR) and mean time between failures (MTBF). Looking at these numbers across a year can show how your ability to respond to incidents is improving or changing. Note how the goal is to minimize the time until recovery, not necessarily to minimize the time until the cause of the outage is fixed. If MTBF is low, it might mean that your team is not investing in testing enough, and this is also probably draining your team. If MTTR is high, it probably means...