Introducing Observability and the Grafana Stack
The modern computer systems we work with have moved from the realm of complicated into the realm of complex, where the number of interacting variables make them ultimately unknowable and uncontrollable. We are using the terms complicated and complex as per system theory. A complicated system, like an engine, has clear causal relationships between components. A complex system, such as the flowing of traffic in a city, shows emergent behavior from the interactions of its components.
With the average cost of downtime estimated to be $9,000 per minute by Ponemon Institute in 2016, this complexity can cause significant financial loss if organizations do not take steps to manage this risk. Observability offers a way to mitigate these risks, but making systems observable comes with its own financial risks if implemented poorly or without a clear business goal.
In this book, we will give you a good understanding of what observability is and who the customers who might use it are. We will explore how to use the tools available from Grafana Labs to gain visibility of your organization. These tools include Loki, Prometheus, Mimir, Tempo, Frontend Observability, Pyroscope, and k6. You will learn how to use Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to obtain clear transparent signals of when a service is operating correctly, and how to use the Grafana incident response tools to handle incidents. Finally, you will learn about managing their observability platform using automation tools such as Ansible, Terraform, and Helm.
This chapter aims to introduce observability to all audiences, using examples outside of the computing world. We’ll introduce the types of telemetry used by observability tools, which will give you an overview of how to use them to quickly understand the state of your services. The various personas who might use observability systems will be outlined so that you can explore complex ideas later with a clear grounding on who will benefit from their correct implementation. Finally, we’ll investigate Grafana’s Loki, Grafana, Tempo, Mimir (LGTM) stack, how to deploy it, and what alternatives exist.
In this chapter, we’re going to cover the following main topics:
- Observability in a nutshell
- Telemetry types and technologies
- Understanding the customers of observability
- Introducing the Grafana stack
- Alternatives to the Grafana stack
- Deploying the Grafana stack