Preface
Since the Prometheus project was first announced to the world in January 2015, it has rapidly become the de facto modern monitoring solution. Open source projects such as Kubernetes expose Prometheus metrics by default, cloud providers sell “managed” Prometheus services, and it even has its own yearly conference. However, in my personal journey to learn and understand Prometheus deeper, I came to a saddening realization. There are a plethora of blog posts, books, and tutorials focused on the basics of Prometheus, but few to no readily available resources that cover running Prometheus at scale.
For that information, I found myself needing to turn to conference talks or trying to extrapolate how others do it by reading through GitHub issues on the Prometheus repository, questions on the Prometheus mailing list, and conversations in the official Prometheus Slack channel. Those learnings – coupled with years of personal experience – have gone into this book as an endeavor to begin filling that void.
Perhaps one of the limiting factors of existing content is that it tends to focus purely on Prometheus itself, omitting the larger ecosystem surrounding it. However, to run Prometheus at scale, it quickly becomes necessary to build on top of Prometheus to extend it. Instead of Prometheus being the destination, it provides a foundation.
We will still cover Prometheus itself in depth. We will see how to get the most out of it, implement best practices, and – perhaps most critically – develop a deeper understanding of how Prometheus’s internals work.
In addition to Prometheus itself, though, we’ll also look at how to operate and scale Prometheus. We’ll learn how to debug Prometheus through Go’s developer tools, how to manage hundreds (or thousands) of Prometheus rules without losing your mind, how to connect Prometheus to remote storage solutions, how to run dozens of highly available Prometheus instances while maintaining a global query view, and much, much more.