Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Observability with Grafana
Observability with Grafana

Observability with Grafana: Monitor, control, and visualize your Kubernetes and cloud platforms using the LGTM stack

Arrow left icon
Profile Icon Rob Chapman Profile Icon Peter Holmes
Arrow right icon
$49.99
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (4 Ratings)
Paperback Jan 2024 356 pages 1st Edition
eBook
$9.99 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Rob Chapman Profile Icon Peter Holmes
Arrow right icon
$49.99
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (4 Ratings)
Paperback Jan 2024 356 pages 1st Edition
eBook
$9.99 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Colour book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Observability with Grafana

Introducing Observability and the Grafana Stack

The modern computer systems we work with have moved from the realm of complicated into the realm of complex, where the number of interacting variables make them ultimately unknowable and uncontrollable. We are using the terms complicated and complex as per system theory. A complicated system, like an engine, has clear causal relationships between components. A complex system, such as the flowing of traffic in a city, shows emergent behavior from the interactions of its components.

With the average cost of downtime estimated to be $9,000 per minute by Ponemon Institute in 2016, this complexity can cause significant financial loss if organizations do not take steps to manage this risk. Observability offers a way to mitigate these risks, but making systems observable comes with its own financial risks if implemented poorly or without a clear business goal.

In this book, we will give you a good understanding of what observability is and who the customers who might use it are. We will explore how to use the tools available from Grafana Labs to gain visibility of your organization. These tools include Loki, Prometheus, Mimir, Tempo, Frontend Observability, Pyroscope, and k6. You will learn how to use Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to obtain clear transparent signals of when a service is operating correctly, and how to use the Grafana incident response tools to handle incidents. Finally, you will learn about managing their observability platform using automation tools such as Ansible, Terraform, and Helm.

This chapter aims to introduce observability to all audiences, using examples outside of the computing world. We’ll introduce the types of telemetry used by observability tools, which will give you an overview of how to use them to quickly understand the state of your services. The various personas who might use observability systems will be outlined so that you can explore complex ideas later with a clear grounding on who will benefit from their correct implementation. Finally, we’ll investigate Grafana’s Loki, Grafana, Tempo, Mimir (LGTM) stack, how to deploy it, and what alternatives exist.

In this chapter, we’re going to cover the following main topics:

  • Observability in a nutshell
  • Telemetry types and technologies
  • Understanding the customers of observability
  • Introducing the Grafana stack
  • Alternatives to the Grafana stack
  • Deploying the Grafana stack

Observability in a nutshell

The term observability is borrowed from control theory. It’s common to use the term interchangeably with the term monitoring in IT systems, as the concepts are closely related. Monitoring is the ability to raise an alarm when something is wrong, while observability is the ability to understand a system and determine whether something is wrong, and why.

Control theory was formalized in the late 1800s on the topic of centrifugal governors in steam engines. This diagram shows a simplified view of such a system:

Figure 1.1 – James Watt’s steam engine flyweight governor (source: https://www.mpoweruk.com)

Figure 1.1 – James Watt’s steam engine flyweight governor (source: https://www.mpoweruk.com)

Steam engines use a boiler to boil water in a pressure vessel. The steam pushes a piston backward and forward, which converts heat energy to reciprocal energy. In steam engines that use centrifugal governors, this reciprocal energy is converted to rotational energy via a wheel connected to a piston. The centrifugal governor provides a physical link backward through the system to the throttle. This means that the speed of rotation controls the throttle, which, in turn, controls the speed of rotation. Physically, this is observed by the balls on the governor flying outward and dropping inward until the system reaches equilibrium.

Monitoring defines the metrics or events that are of interest in advance. For instance, the governor measures the pre-defined metric of drive shaft revolutions. The controllability of the throttle is then provided by the pivot and actuator rod assembly. Assuming the actuator rod is adjusted correctly, the governor should control the throttle from fully open to fully closed.

In contrast, observability is achieved by allowing the internal state of the system to be inferred from its external outputs. If the operating point adjustment is incorrectly set, the governor may spin too fast or too slowly, rendering the throttle control ineffective. A governor spinning too fast or too slowly could also indicate that the sliding ring is stuck in place and needs oiling. Importantly, this insight can be gained without defining in advance what too fast or too slow means. The insight that the governor is spinning too fast or too slowly also needs very little knowledge of the full steam engine.

Fundamentally, both monitoring and observability are used to improve the reliability and performance of the system in question.

Now that we have introduced the high-level concepts, let’s explore a practical example outside of the world of software services.

Case study – A ship passing through the Panama Canal

Let’s imagine a ship traversing the Agua Clara locks on the Panama Canal. This can be illustrated using the following figure:

Figure 1.2 – The Agua Clara locks on the Panama Canal

Figure 1.2 – The Agua Clara locks on the Panama Canal

There are a few aspects of these locks that we might want to monitor:

  • The successful opening and closing of each gate
  • The water level inside each lock
  • How long it takes for a ship to traverse the locks

Monitoring these aspects may highlight situations that we need to be alerted about:

  • A gate is stuck open because of a mechanical failure
  • The water level is rapidly descending because of a leak
  • A ship is taking too long to exit the locks because it is stuck

There may be situations where the data we are monitoring are within acceptable limits, but we can still observe a deviation from what is considered normal, which should prompt further action:

  • A small leak has formed near the top of the lock wall:
    • We would see the water level drop but only when it is above the leak
    • This could prompt maintenance work on the lock wall
  • A gate in one lock is opening more slowly because it needs maintenance:
    • We would see the time between opening and closing the gate increase
    • This could prompt maintenance on the lock gate
  • Ships take longer to traverse the locks when the wind is coming from a particular direction:
    • We could compare hourly average traversal rates
    • This could prompt work to reduce the impact of wind from one direction

Now that we’ve seen an example of measuring a real-world system, we can group these types of measurements into different data types to best suit the application. Let’s introduce those now.

Telemetry types and technologies

The boring but important part of observability tools is telemetry – capturing data that is useful, shipping it from place to place, and producing visualizations, alerts, and reports that offer value to the organization.

Three main types of telemetry are used to build monitoring and observability systems – metrics, logs, and distributed traces. Other telemetry types may be used by some vendors and in particular circumstances. We will touch on these here, but they will be explored in more detail in Chapters 12 and 13 of this book.

Metrics

Metrics can be thought of as numeric data that is recorded at a point in time and enriched with labels or dimensions to enable analysis. Metrics are frequently generated and are easy to search, making them ideal for determining whether something is wrong or unusual. Let’s look at an example of metrics showing temporal changes:

Figure 1.3 – Metrics showing changes over time

Figure 1.3 – Metrics showing changes over time

Taking our example of the Panama Canal, we could represent the water level in each lock as a metric, to be measured at regular intervals. To be able to use the data effectively, we might add some of these labels:

  • The lock name: Agua Clara
  • The lock chamber: Lower lock
  • The canal: Panama Canal

Logs

Logs are considered to be unstructured string data types. They are recorded at a point in time and usually contain a huge amount of information about what is happening. While logs can be structured, there is no guarantee of that structure persisting, because the log producer has control over the structure of the log. Let’s look at an example:

Jun 26 2016 20:31:01 pc-ac-g1 gate-events no obstructions seen
Jun 26 2016 20:32:01 pc-ac-g1 gate-events starting motors
Jun 26 2016 20:32:30 pc-ac-g1 gate-events motors engaged successfully
Jun 26 2016 20:35:30 pc-ac-g1 gate-events stopping motors
Jun 26 2016 20:35:30 pc-ac-g1 gate-events gate open complete

In our example, the various operations involved in opening or closing a lock gate could be represented as logs.

Almost every system produces logs, and they often give very detailed information. This is great for understanding what happened. However, the volume of data presents two problems:

  • Searching can be inefficient and slow.
  • As the data is in text format, knowing what to search for can be difficult. For example, error occurred, process failed, and action did not complete successfully could all be used to describe a failure, but there are no shared strings to search for.

Let’s consider a real log entry from a computer system to see how log data is usually represented:

Figure 1.4 – Logs showing discrete events in time

Figure 1.4 – Logs showing discrete events in time

We can clearly see that we have a number of fields that have been extracted from the log entry by the system. These fields detail where the log entry originated from, what time it occurred, and various other items.

Distributed traces

Distributed traces show the end-to-end journey of an action. They are captured from every step that is taken to complete the action. Let’s imagine a trace that covers the passage of a ship through the lock system. We will be interested in the time a ship enters and leaves each lock, and we will want to be able to compare different ships using the system. A full passage can be given an identifier, usually called a trace ID. Traces are made up of spans. In our example, a span would cover the entry and exit for each individual lock. These spans are given a second identifier, called a span ID. To tie these two together, each span in a trace references the trace ID for the whole trace. The following screenshot shows an example of how a distributed trace is represented for a computer application:

Figure 1.5 – Traces showing the relationship of actions over time

Figure 1.5 – Traces showing the relationship of actions over time

Now that we have introduced metrics, logs, and traces, let’s consider a more detailed example of a ship passing through the locks, and how each telemetry type would be produced in this process:

  1. Ship enters the first lock:
    • Span ID created
    • Trace ID created
    • Contextual information is added to the span, for example, a ship identification
    • Key events are recorded in the span with time stamps, for example, gates are opened and closed
  2. Ship exits the first lock:
    • Span closed and submitted to the recording system
    • Second lock notified of trace ID and span ID
  3. Ship enters the second lock:
    • Span ID created
    • Trace ID added to span
    • Contextual information is added to the span
    • Key events recorded in the span with time stamps
  4. Ship exits the second lock:
    • Span closed and submitted to the recording system
    • Third lock notified of trace ID and span ID
  5. Ship enters the third lock:
    • Repeat step 3
  6. Ship exits the third lock:
    • Span closed and submitted to the recording system

Now let’s look at some other telemetry types.

Other telemetry types

Metrics, logs, and traces are often called the three pillars or the golden triangle of observability. As we outlined earlier, observability is the ability to understand a system. While metrics, logs, and traces give us a very good ability to understand a system, they are not the only signals we might need, as this depends at what abstraction layer we need to observe the system. For instance, when looking at a very detailed level, we may be very interested in the stack trace of an application’s activity at the CPU and RAM level. Conversely, if we are interested in the execution of a CI/CD pipeline, we may just be interested in whether a deployment occurred and nothing more.

Profiling data (stack traces) can give us a very detailed technical view of the system’s use of resources such as CPU cycles or memory. With cloud services often charged per hour for these resources, this kind of detailed analysis can easily create cost savings.

Similarly, events can be consumed from a platform, such as CI/CD. These can offer a huge amount of insight that can reduce the Mean Time to Recovery (MTTR). Imagine responding to an out-of-hours alert and seeing that a new version of a service was deployed immediately before the issues started occurring. Even better, imagine not having to wake up because the deployment process could check for failures and roll back automatically. Events differ from logs only in that an event represents a whole action. In our earlier example in the Logs section, we created five logs, but all of these referred to stages of the same event (opening the lock gate). As a relatively generic term, event gets used with other meanings.

Now that we’ve introduced the fundamental concepts of the technology, let’s talk about the customers who will use observability data.

Introducing the user personas of observers

Observability deals with understanding a system, identifying whether something is wrong with that system, and understanding why it is wrong. But what do we mean by understanding a system? The simple answer would be knowing the state of a single application or infrastructure component.

In this section, we will introduce the user personas that we will use throughout this book. These personas will help to distinguish the different types of questions that people use observability systems to answer.

Let’s take a quick look at the user personas that will be used throughout the book as examples, and their roles:

Name and role

Description

Diego Developer

Frontend, backend, full stack, and so on

Ophelia Operator

SRE, DevOps, DevSecOps, customer success, and so on

Steven Service

Service manager and other tasks

Pelé Product

Product manager, product owner, and so on

A picture containing vector graphics

Description automatically generated

Masha Manager

Manager, senior leadership, and so on

Table 1.1 – User persona introductions

Now let’s look at each of these users in greater detail.

Diego Developer

Diego Developer works on many types of systems, from frontend applications that customers directly interact with, to backend systems that let his organization store data in ways that delight its customers. You might even find him working on platforms that other developers use to get their applications integrated, built, delivered, and deployed safely and speedily.

Goals

He writes great software that is well tested and addresses customers’ actual needs.

Interactions

When he is not writing code, he works with Ophelia Operator to address any questions and issues that occur.

Pelé Product works in his team and provides insight into the customer’s needs. They work together closely, taking those needs and turning them into detailed plans on how to deliver software that addresses them.

Steven Service is keen to ensure that the changes Diego makes are not impacting customer commitments. He’s also the one who wakes Diego up if there is an incident that needs attention. The data provided to Masha Manager gives her a breakdown of costs. When Diego is working on developer platforms, he also collects data that helps her get investment from the business into teams that are not performing as expected.

Needs

Diego really needs easy-to-use libraries for the languages he uses to instrument the code he produces. He does not have time to become an expert. He wants to be able to add a few lines of code and get results quickly.

Having a clear standard for acceptable performance measures makes it easy for him to get the right results.

Pain points

When Diego’s systems produce too much data, he finds it difficult to sort signal from noise. He also gets frustrated having to change his code because of an upstream decision to change tooling.

Ophelia Operator

Ophelia Operator works in an operations-focused environment. You might find her in a customer-facing role or as part of a development team as a DevOps engineer. She could be part of a group dedicated to the reliability of an organization’s systems, or she could be working in security or finance to ensure the business runs securely and smoothly.

Goals

Ophelia wants to make sure a product is functioning as expected. She also likes it when she is not woken up early in the morning by an incident.

Interactions

Ophelia will work a lot with Diego Developer; sometimes it’s escalating customer tickets when she doesn’t have the data available to understand the problem; at other times it’s developing runbooks to keep the systems running. Sometimes she will need to give Diego clear information on acceptable performance measures so that her team can make sure systems perform well for customers.

Steven Service works closely with Ophelia. They work together to ensure there are not many incidents, and that they are quickly resolved. Steven makes sure that business data on changes and incidents is tracked, and tweaks processes when things aren’t working.

Pelé Product likes to have data showing the problematic areas of his products.

Needs

Good data is necessary to do the job effectively. Being able to see that a customer has encountered an error can make the difference between resolving a problem straight away or having them wait maybe weeks for a response.

During an incident seeing that a new version of a service was deployed at the time a problem started can change an hours-long incident into a brief blip, and keep customers happy.

Pain points

Getting continuous alerts but not being empowered to fix the underlying issue is a big problem. Ophelia has seen colleagues burn out, and it makes her want to leave the organization when this happens.

Steven Service

Steven Service works in service delivery. He is interested in making sure the organization’s services are delivered smoothly. Jumping in on critical incidents and coordinating actions to get them resolved as quickly as possible is part of the job. So is ensuring that changes are made using processes that help others do it as safely as possible. Steven also works with third parties who provide services that are critical to the running of the organization.

Goals

He wants services to run as smoothly as possible so that the organization can spend more time focused on customers.

Interactions

Diego Developer and Ophelia Operator work a lot with the change management processes created by Steven and the support processes he manages. Having accurate data to hand during change management really helps to make the process as smooth as possible.

Steven works very closely with Masha Manager to make sure she has access to data showing where processes are working smoothly and where they need to spend time improving them.

Needs

He needs to be able to compare the delivery of different products and provide that data to Masha and the business.

During incidents, he needs to be able to get the right people on the call as quickly as possible and keep a record of what happened for the incident post-mortem.

Pain points

Being able to identify the right person to get on a call during an incident is a common problem he faces. Seeing incidents drag on while different systems are compared and who can fix the problem is argued about is also a big concern to him.

Pelé Product

Pelé Product works in the product team. You’ll find him working with customers to understand their needs, keeping product roadmaps in order, and communicating requirements back to developers such as Diego Developer so they can build them. You might also find him understanding and shaping the product backlog for the internal platforms used by developers in the organization.

Goal

Pelé wants to understand customers, give them products that delight them, and keep them coming back.

Interactions

He spends a lot of time working with Diego when they can look at the same information to really understand what customers are doing and how they can help them do it better.

Ophelia Operator and Steven Service help Pelé keep products on track. If too many incidents occur, they ask everyone to refocus on getting stability right. There is no point in providing customers with lots of features on a system that they can’t trust.

Pelé works closely with Masha Manager to ensure the organization has the right skills in the teams that build products. The business depends on her leadership to make sure that these developers have the best tools to help them get their code live in front of customers where it can be used.

Needs

Pelé needs to be able to understand customers’ pain points even when they do not articulate them clearly during user research.

He needs data that gives him a common language with Diego and Ophelia. Sometimes they can get too focused on specific numbers such as shaving off a couple of milliseconds from a request, when improving a poor workflow would improve the customer experience more significantly.

Pain points

Pelé hates not being able to see at a high level what customers are doing. Understanding which bits of an application have the most usage, and which bits are not used at all, lets him know where to focus time and resources.

While customers never tell him they want stability, if it’s not there they will lose trust very quickly and start to look at alternatives.

Masha Manager

Masha works in management. You might find her leading a team and working closely with them daily. She also represents middle management, setting strategy and making tactical choices, and she is involved, to some extent, in senior leadership. Much of her role involves managing budgets and people. If something can make that process easier, then she is usually interested in hearing about it. What Masha does not want to do is waste the organization’s money, because that can directly impact jobs.

Goals

Her primary goals are to keep the organization running smoothly and ensure the budget is balanced.

Interactions

As a leader, Masha needs accurate data and needs to be able to trust the teams who provide that data. The data could be the end-to-end cycle time of feature concept to delivery from Pelé Product, the lead time for changes from Diego Developer, or even the MTTR from Steven Service. Having that data helps her to understand where focus and resources can have the biggest impact.

Masha works regularly with the financial operations staff and needs to make sure they have accurate information on the organization’s expenditure and the value that expenditure provides.

Needs

She needs good data in a place where she can view it and make good decisions. This usually means she consumes information from a business intelligence system. To use such tools effectively, she needs to be clear on what the organization’s goals are, so that the correct data can be collected to help her understand how her teams are tracking to that goal.

She also needs to know that the teams she is responsible for have the correct data and tools to excel in their given areas.

Pain points

High failure rates and long recovery time usually result in her having to speak with customers to apologize. Masha really hates these calls!

Poor visibility of cloud systems is a particular concern. Masha has too many horror stories of huge overspending caused by a lack of monitoring; she would rather spend that budget on something more useful.

You now know about the customers who use observability data, and the types of data you will be using to meet their needs. As the main focus of this book is on Grafana as the underlying technology, let’s now introduce the tools that make up the Grafana stack.

Introducing the Grafana stack

Grafana was born in 2013 when a developer was looking for a new user interface to display metrics from Graphite. Initially forked from Kibana, the Grafana project was developed to make it easy to build quick, interactive dashboards that were valuable to organizations. In 2014, Grafana Labs was formed with the core value of building a sustainable business with a strong commitment to open source projects. From that foundation, Grafana has grown into a strong company supporting more than 1 million active installations. Grafana Labs is a huge contributor to open source projects, from their own tools to widely adopted technologies such as Prometheus, and recent initiatives with a lot of traction such as OpenTelemetry.

Grafana offers many tools, which we’ve grouped into the following categories:

  • The core Grafana stack: LGTM and the Grafana Agent
  • Grafana enterprise plugins
  • Incident response tools
  • Other Grafana tools

Let’s explore these tools in the following sections.

The core Grafana stack

The core Grafana stack consists of Mimir, Loki, Tempo, and Grafana; the acronym LGTM is often used to refer to this tech stack.

Mimir

Mimir is a Time Series Database (TSDB) for the storage of metric data. It uses low-cost object storage such as S3, GCS, or Azure Blob Storage. First announced for general availability in March 2022, Mimir is the newest of the four products we’ll discuss here, although it’s worth highlighting that Mimir initially forked from another project, Cortex, which was started in 2016. Parts of Cortex also form the core of Loki and Tempo.

Mimir is a fully Prometheus-compatible solution that addresses the common scalability problems encountered with storing and searching huge quantities of metric data. In 2021 Mimir was load tested to 1 billion active time series. An active time series is a metric with a value and unique labels that has reported a sample in the last 20 minutes. We will explore Mimir and Prometheus in much greater detail in Chapter 5.

Loki

Loki is a set of components that offer a full feature logging stack. Loki uses lower-cost object storage such as S3 or GCS, and only indexes label metadata. Loki entered general availability in November 2019.

Log aggregation tools typically use two data structures to store log data. An index that contains references to the location of the raw data paired with searchable metadata, and the raw data itself stored in a compressed form. Loki differs from a lot of other log aggregation tools by keeping the index data relatively small and scaling the search functionality by using horizontal scaling of the querying component. The process of selecting the best index fields is one we will cover in Chapter 4.

Tempo

Tempo is a storage backend for high-scale distributed trace telemetry, with the aim of sampling 100% of the read path. Like Loki and Mimir, it leverages lower-cost object storage such as S3, GCS, or Azure Blob Storage. Tempo went into general availability in June 2021.

When Tempo released 1.0, it was tested at a sustained ingestion of >2 million spans per second (about 350 MB per second). Tempo also offers the ability to generate metrics from spans as they are ingested; these metrics can be written to any backend that supports Prometheus remote write. Tempo is explored in detail in Chapter 6.

Grafana

Grafana has been a staple for fantastic visualization of data since 2014. It has targeted the ability to connect to a huge variety of data sources from TSDBs to relational databases and even other observability tools. Grafana has over 150 data source plugins available. Grafana has a huge community using it for many different purposes. This community supports over 6,000 dashboards, which means there is a starting place for most available technologies with minimal time to value.

Grafana Agent

Collecting telemetry from many places is one of the fundamental aspects of observability. Grafana Agent is a collection of tools for collecting logs, metrics, and traces. There are many other collection tools that Grafana integrates well with. Different collection tools offer different advantages and disadvantages, which is not a topic we will explore in this book. We will highlight other tools in the space later in this chapter and in Chapter 2 to give you a starting point for learning more about this topic. We will also briefly discuss architecting a collection infrastructure in Chapter 11.

The Grafana stack is a fantastic group of open source software for observability. The commitment of Grafana Labs to open source is supported by great enterprise plugins. Let’s explore them now.

Grafana Enterprise plugins

As part of their Cloud Pro, Cloud Advanced, and Enterprise license offerings, Grafana offers Enterprise plugins. These are part of any paid subscription to Grafana.

The Enterprise data source plugins allow organizations to read data from many other storage tools they may use, from software development tools such as GitLab and Azure DevOps to business intelligence tools such as Snowflake, Databricks, and Looker. Grafana also offers tools to read data from many other observability tools, which enables organizations to build comprehensive operational coverage while offering individual teams a choice of the tools they use.

Alongside the data source plugins, Grafana offers premium tools for logs, metrics, and traces. These include access policies and tokens for log data to secure sensitive information, in-depth health monitoring for the ingest and storage of cloud stacks, and management of tenants.

Grafana incident response and management

Grafana offers three products in the incident response and management (IRM) space:

  • At the foundation of IRM are alerting rules, which can notify via messaging apps, email, or Grafana OnCall
  • Grafana OnCall offers an on-call schedule management system that centralizes alert grouping and escalation routing
  • Finally, Grafana Incident offers a chatbot functionality that can set up necessary incident spaces, collect timelines for a post-incident review process, and even manage the incident directly from a messaging service

These tools are covered in more detail in Chapter 9. Now let’s take a look at some other important Grafana tools.

Other Grafana tools

Grafana Labs continues to be a leader in observability and has acquired several companies in this space to release new products that complement the tools we’ve already discussed. Let’s discuss some of these tools now.

Faro

Grafana Faro is a JavaScript agent that can be added to frontend web applications. The project allows for real user monitoring (RUM) by collecting telemetry from a browser. By adding RUM into an environment where backend applications and infrastructure are instrumented, observers gain the ability to traverse data from the full application stack. Faro supports the collection of the five core web vitals out of the box, as well as several other signals of interest. Faro entered general availability in November 2022. We cover Faro in more detail in Chapter 12.

k6

k6 is a load testing tool that provides both a packaged tool to run in your own infrastructure and a cloud Software as a Service (SaaS) offering. Load testing, especially as part of a CI/CD pipeline, really enables teams to see how their application will perform under load, and evaluate optimizations and refactoring. Paired with other detailed analysis tools such as Pyroscope, the level of visibility and accessibility to non-technical members of the team can be astounding. The project started back in 2016 and was acquired by Grafana Labs in June 2021. The goal of k6 is to make performance testing easy and repeatable. We’ll explore k6 in Chapter 13.

Pyroscope

Pyroscope is a recent acquisition of Grafana Labs, joining in March 2023. Pyroscope is a tool that enable teams to engage in the continuous profiling of system resource use by applications (CPU, memory, etc.). Pyroscope advertises that with a minimal overhead of ~2-5% of performance, they can collect samples as frequently as every 10 seconds. Phlare is a Grafana Labs project started in 2022, and the two projects have now merged. We discuss Pyroscope in more detail in Chapter 13.

Now that you know the different tools available from Grafana Labs, let’s look at some alternatives that are available.

Alternatives to the Grafana stack

The monitoring and observability space is packed with different open and closed source solutions such as ps and top going back to the 70s and 80s. We will not attempt to list every tool here; we aim to offer a source of inspiration for people who are curious and want to explore, or who need a quick reference of the available tools (as the authors have on a few occasions).

Data collection

These are agent tools that can be used to collect telemetry from the source:

Tool Name

Telemetry Types

OpenTelemetry Collector

Metrics, logs, traces

FluentBit

Metrics, logs, traces

Vector

Metrics, logs, traces

Vendor-specific agents

(See the Data storage, processing, and visualization section for an expanded list)

Metrics, logs, traces

Beats family

Metrics, logs

Prometheus

Metrics

Telegraf

Metrics

StatsD

Metrics

Collectd

Metrics

Carbon

Metrics

Syslog-ng

Logs

Rsyslog

Logs

Fluentd

Logs

Flume

Logs

Zipkin Collector

Traces

Table 1.2 – Data collection tools

Data collection is only one piece of the extract transform and load process for observability data. The next section introduces tools to transform and load data.

Data storage, processing, and visualization

We’ve grouped data processing, storage, and visualization together, as there are often a lot of crossovers among them. There are certain tools that also provide security monitoring and are closely related. However, as this topic is outside of the scope of this book, we have chosen to exclude tools that are solely in the security space.

Tool Name

Tool Name

Tool Name

AppDynamics

InfluxDB

Sematext

Aspecto

Instana

Sensu

AWS CloudWatch & CloudTrail

Jaeger

Sentry

Azure Application insights

Kibana

Serverless360

Centreon

Lightstep

SigNoz

ClickHouse

Loggly

SkyWalking

Coralogix

LogicMonitor

Solarwinds

Cortex

Logtail

Sonic

Cyclotron

Logz.io

Splunk

Datadog

Mezmo

Sumo Logic

Dynatrace

Nagios

TelemetryHub

Elastic

NetData

Teletrace

GCP Cloud Operations Suite

New Relic

Thanos

Grafana Labs

OpenSearch

Uptrace

Graphite

OpenTSDB

VictoriaMetrics

Graylog

Prometheus

Zabbix

Honeycomb

Scalyr

Zipkin

Table 1.3 – Data storage processing and visualization tools

With a good understanding of the tools available in this space, let’s now look at the ways we can deploy the tools offered by Grafana.

Deploying the Grafana stack

Grafana Labs fully embraces its history as an open source software provider. The LGTM stack, alongside most other components, is open source. There are a few value-added components that are offered as part of an enterprise subscription.

As a SaaS offering, Grafana Labs provides access to storage for Loki, Mimir, and Tempo, alongside Grafana’s 100+ integrations for external data sources. As a SaaS customer, you also gain ready access to a huge range of other tools you may use and can present them in a consolidated manner, in a single pane of glass. The SaaS offering allows organizations to leverage a full-featured observability platform without the operational overhead of running the platform and obtaining service level agreements for the operation of the platform.

As well as managing the platform for you, you can run Grafana on your organization’s infrastructure. Grafana offers its software packaged in several formats for Linux and Windows deployments, as well as offering containerized versions. Grafana also offers Helm and Tanka configuration wrappers for each of their tools. This book will mainly concentrate on the SaaS offering because it is easy to get started with the free tier. We will explore some areas where a local installation can assist in Chapters 11 and 14, which cover architecting and supporting DevOps processes respectively.

Summary

In this chapter, you have been introduced to monitoring and observability, how they are similar, and how they differ. The Agua Clara locks on the Panama Canal acted as a simplified example of the concepts of observability in practice. The key takeaway should be to understand that even when a system produces alerts for significant problems, the same data can be used to observe and investigate other potential problems.

We also talked about the customers who might use observability systems. These customers will be referenced throughout this book when we explore a concept and how to target its implementation.

Finally, we introduced the full Grafana Labs stack, and you should now have a good understanding of the different purposes that each product serves.

In the next chapter, we will introduce the basics of adding instrumentation to applications or infrastructure components for readers whose roles are similar to those of Diego and Ophelia.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Use personas to better understand the needs and challenges of observability tools users
  • Get hands-on practice with Grafana and the LGTM stack through real-world examples
  • Implement and integrate LGTM with AWS, Azure, GCP, Kubernetes and tools such as OpenTelemetry, Ansible, Terraform, and Helm
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

To overcome application monitoring and observability challenges, Grafana Labs offers a modern, highly scalable, cost-effective Loki, Grafana, Tempo, and Mimir (LGTM) stack along with Prometheus for the collection, visualization, and storage of telemetry data. Beginning with an overview of observability concepts, this book teaches you how to instrument code and monitor systems in practice using standard protocols and Grafana libraries. As you progress, you’ll create a free Grafana cloud instance and deploy a demo application to a Kubernetes cluster to delve into the implementation of the LGTM stack. You’ll learn how to connect Grafana Cloud to AWS, GCP, and Azure to collect infrastructure data, build interactive dashboards, make use of service level indicators and objectives to produce great alerts, and leverage the AI & ML capabilities to keep your systems healthy. You’ll also explore real user monitoring with Faro and performance monitoring with Pyroscope and k6. Advanced concepts like architecting a Grafana installation, using automation and infrastructure as code tools for DevOps processes, troubleshooting strategies, and best practices to avoid common pitfalls will also be covered. After reading this book, you’ll be able to use the Grafana stack to deliver amazing operational results for the systems your organization uses.

Who is this book for?

If you’re an application developer, a DevOps engineer, a SRE, platform engineer, or a cloud engineer concerned with Day 2+ systems operations, then this book is for you. Product owners and technical leaders wanting to gain visibility of their products in a standardized, easy to implement way will also benefit from this book. A basic understanding of computer systems, cloud computing, cloud platforms, DevOps processes, Docker or Podman, Kubernetes, cloud native, and similar concepts will be useful.

What you will learn

  • Understand fundamentals of observability, logs, metrics, and distributed traces
  • Find out how to instrument an application using Grafana and OpenTelemetry
  • Collect data and monitor cloud, Linux, and Kubernetes platforms
  • Build queries and visualizations using LogQL, PromQL, and TraceQL
  • Manage incidents and alerts using AI-powered incident management
  • Deploy and monitor CI/CD pipelines to automatically validate the desired results
  • Take control of observability costs with powerful in-built features
  • Architect and manage an observability platform using Grafana
Estimated delivery fee Deliver to Indonesia

Standard delivery 10 - 13 business days

$12.95

Premium delivery 5 - 8 business days

$45.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 12, 2024
Length: 356 pages
Edition : 1st
Language : English
ISBN-13 : 9781803248004
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Colour book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Indonesia

Standard delivery 10 - 13 business days

$12.95

Premium delivery 5 - 8 business days

$45.95
(Includes tracking information)

Product Details

Publication date : Jan 12, 2024
Length: 356 pages
Edition : 1st
Language : English
ISBN-13 : 9781803248004
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 149.97
The Ultimate Docker Container Book
$49.99
Modern DevOps Practices
$49.99
Observability with Grafana
$49.99
Total $ 149.97 Stars icon
Banner background image

Table of Contents

21 Chapters
Part 1: Get Started with Grafana and Observability Chevron down icon Chevron up icon
Chapter 1: Introducing Observability and the Grafana Stack Chevron down icon Chevron up icon
Chapter 2: Instrumenting Applications and Infrastructure Chevron down icon Chevron up icon
Chapter 3: Setting Up a Learning Environment with Demo Applications Chevron down icon Chevron up icon
Part 2: Implement Telemetry in Grafana Chevron down icon Chevron up icon
Chapter 4: Looking at Logs with Grafana Loki Chevron down icon Chevron up icon
Chapter 5: Monitoring with Metrics Using Grafana Mimir and Prometheus Chevron down icon Chevron up icon
Chapter 6: Tracing Technicalities with Grafana Tempo Chevron down icon Chevron up icon
Chapter 7: Interrogating Infrastructure with Kubernetes, AWS, GCP, and Azure Chevron down icon Chevron up icon
Part 3: Grafana in Practice Chevron down icon Chevron up icon
Chapter 8: Displaying Data with Dashboards Chevron down icon Chevron up icon
Chapter 9: Managing Incidents Using Alerts Chevron down icon Chevron up icon
Chapter 10: Automation with Infrastructure as Code Chevron down icon Chevron up icon
Chapter 11: Architecting an Observability Platform Chevron down icon Chevron up icon
Part 4: Advanced Applications and Best Practices of Grafana Chevron down icon Chevron up icon
Chapter 12: Real User Monitoring with Grafana Chevron down icon Chevron up icon
Chapter 13: Application Performance with Grafana Pyroscope and k6 Chevron down icon Chevron up icon
Chapter 14: Supporting DevOps Processes with Observability Chevron down icon Chevron up icon
Chapter 15: Troubleshooting, Implementing Best Practices, and More with Grafana Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
(4 Ratings)
5 star 75%
4 star 0%
3 star 0%
2 star 0%
1 star 25%
Average Joe Feb 21, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I enjoyed the level of detail that was provided in this book. It was well organized for both beginners and experts of application developers. If you want great awareness of your application is operation, then this book is for you.
Amazon Verified review Amazon
Chappers Jan 17, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
A well written tech book, lets you easily follow along with helpful steps to help you create a Grafana cloud instance and deploy a demo application to a Kubernetes cluster.Nothing is better than something you can follow along rather than an extensive dump of technical information! Thank you for putting this book together!
Amazon Verified review Amazon
Jan Jan 16, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I recently had the opportunity to read this book that provided a comprehensive and up-to-date overview of observability with Grafana. I was particularly impressed with that the book incorporated also recent new features/services from Grafana labs.One aspect that stood out to me was the book's coverage of OpenTelemetry. This is crucial as it ensures that users are not locked into a specific vendor - it is a Grafana's commitment to providing a flexible and open solution for observability.I found myself agreeing with all the concepts mentioned in the book, as they aligned with industry best practices and provided practical insights into implementing effective observability strategies.Overall, I highly recommend this book to anyone interested in gaining a comprehensive understanding of observability with LGTM stack. Its up-to-date content, inclusion of recent features, and emphasis on OpenTelemetry support make it a valuable resource for both newcomers and seasoned professionals in the field.
Amazon Verified review Amazon
Akko Apr 20, 2024
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
It is just to force you to use grafana cloud. Everything is there in grafana just use that's all it there in this book. It never talks anything apart from this. Really repending on buying this book. You will never learn anything apart from on where to find result on cloud and how to run jars prepared by author to send data to cloud. Nothing else absolute waste.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela