Tracking system events
Collecting and organizing our metrics is just the first step. Now we need to start using this data to see what we can learn about our services and how they perform in the real world.
We naturally find ourselves creating dashboards to visualize this data. We should have a high-level dashboard that gives us the status of the subsystem at a glance and a low-level dashboard that allows us to drill into the details. We should be able to filter these dashboards by the various tags, such as account, region, stage, service, function, and so forth. These dashboards will be invaluable when we need to do root cause analysis.
But we cannot keep our eyes on these dashboards all the time. We need the system to watch the metrics for us, record anything interesting, send us early warnings, and page us when it matters. In other words, to fail forward fast, we need to monitor our subsystem’s resource metrics for various conditions. Here are several examples of...