Metrics show an aggregated point of view for the status of the whole cluster. They allow us to detect trending problems, but it's difficult to find a single spurious error.
Don't underestimate them, though. They are critical for successful monitoring because they tell us whether the system is healthy. In some companies, the most critical metrics are prominently displayed in screens on the wall so that the operations team can see them at all times and quickly react.
Finding the proper balance for metrics in a system is not a straightforward task and will require time and trial and error. There are four metrics for online services that are always important, though. These are as follows:
- Latency: How many milliseconds the system takes to respond to a request.
Depending on the times, a different time unit, such as seconds or microseconds, can be used. From...