Achieving high availability (HA) in Prometheus
Your monitoring environment needs to be one of your most resilient services. It can be a joke that there’s no such thing as 100% uptime, but your monitoring environment should come pretty darn close. After all, it’s what you depend on to let you know when your other services aren’t achieving their 99.9% uptime goal.
Thus far, we’ve only used Prometheus in a single-point-of-failure mode. If Prometheus goes down, all of its metrics and alerts go down with it. This gap in visibility and alerting is unacceptable. So, what can we do about it if Prometheus doesn’t have built-in HA like Alertmanager? The answer? Duplicate it.
Who watches the watchmen?
With an HA Prometheus setup, you can (and should) configure your Prometheus instances so that they monitor each other. Presuming they’re not running on the same physical hardware, unexpected failures should be isolated and you can be alerted to...