Proactively working with metrics
As we've seen, metrics show an aggregated point of view for the status of the whole cluster. They allow you to detect trending problems, but it's difficult to pinpoint a single spurious error.
This shouldn't stop us from considering them as a critical tool for successful monitoring because they can tell whether the whole system is healthy. In some companies, the most critical metrics are on permanent display on screens so the operations team can see them and react quickly to any sudden problem.
Finding the proper balance of what metrics are the key ones for a service is not as straightforward as it seems, and it will require time and experience, perhaps even trial and error.
There are, though, four metrics for online services that are considered always important. They are:
- Latency: How many milliseconds it takes for the system to respond to a request. Depending on the service, sometimes seconds can be used instead...