We have accepted the reality that to err is human and that our bounded isolated components will inevitably experience failures. We will instead focus our energies on the mean time to recovery. We have instrumented our components to be highly observable and we have strategically created synthetic transactions that continuously generate traffic through the system so that we can observe the behavior of key performance indicators. We have created alerts that monitor the key performance indicators, so that we can jump into action as soon as a problem is detected. From here we need a method for investigating the problem and diagnosing the root cause that allows us to focus our attention and recover as quickly as possible.
Teams should create a dashboard for each component in advance. A dashboard should display all the work metrics for a component and the metrics for...