Mastering Distributed Tracing

Part I. Introduction

Chapter 1. Why Distributed Tracing?

Modern, internet-scale, cloud-native applications are very complex distributed systems. Building them is hard and debugging them is even harder. The growing popularity of microservices and functions-as-a-service (also known as FaaS or serverless) only exacerbates the problem; these architectural styles bring many benefits to the organizations adopting them, while complicating some of the aspects of operating the systems even further.

In this chapter, I will talk about the challenges of monitoring and troubleshooting distributed systems, including those built with microservices, and discuss how and why distributed tracing is in a unique position among the observability tools to address this problem. I will also describe my personal history with distributed tracing and why I decided to write this book.

Microservices and cloud-native applications

In the last decade, we saw a significant shift in how modern, internet-scale applications are being built. Cloud computing (infrastructure as a service) and containerization technologies (popularized by Docker) enabled a new breed of distributed system designs commonly referred to as microservices (and their next incarnation, FaaS). Successful companies like Twitter and Netflix have been able to leverage them to build highly scalable, efficient, and reliable systems, and to deliver more features faster to their customers.

While there is no official definition of microservices, a certain consensus has evolved over time in the industry. Martin Fowler, the author of many books on software design, argues that microservices architectures exhibit the following common characteristics [1]:

Componentization via (micro)services: The componentization of functionality in a complex application is achieved via services, or microservices, that are independent processes communicating over a network. The microservices are designed to provide fine-grained interfaces and to be small in size, autonomously developed, and independently deployable.
Smart endpoints and dumb pipes: The communications between services utilize technology-agnostic protocols such as HTTP and REST, as opposed to smart mechanisms like the Enterprise Service Bus (ESB).
Organized around business capabilities: Products not projects: the services are organized around business functions ("user profile service" or "fulfillment service"), rather than technologies. The development process treats the services as continuously evolving products rather than projects that are considered to be completed once delivered.
Decentralized governance: Allows different microservices to be implemented using different technology stacks.
Decentralized data management: Manifests in the decisions for both the conceptual data models and the data storage technologies being made independently between services.
Infrastructure automation: The services are built, released, and deployed with automated processes, utilizing automated testing, continuous integration, and continuous deployment.
Design for failure: The services are always expected to tolerate failures of their dependencies and either retry the requests or gracefully degrade their own functionality.
Evolutionary design: Individual components of a microservices architecture are expected to evolve independently, without forcing upgrades on the components that depend on them.

Because of the large number of microservices involved in building modern applications, rapid provisioning, rapid deployment via decentralized continuous delivery, strict DevOps practices, and holistic service monitoring are necessary to effectively develop, maintain, and operate such applications. The infrastructure requirements imposed by the microservices architectures spawned a whole new area of development of infrastructure platforms and tools for managing these complex cloud-native applications. In 2015, the Cloud Native Computing Foundation (CNCF) was created as a vendor-neutral home for many emerging open source projects in this area, such as Kubernetes, Prometheus, Linkerd, and so on, with a mission to "make cloud-native computing ubiquitous."

"Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.
These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil."
-- Cloud Native Computing Foundation Charter [2]

At the time of writing, the list of graduated and incubating projects at CNCF [3] contained 20 projects (Figure 1.1). They all have a single common theme: providing a platform for efficient deployment and operation of cloud-native applications. The observability tools occupy an arguably disproportionate (20 percent) number of slots:

Prometheus: A monitoring and alerting platform
Fluentd: A logging data collection layer
OpenTracing: A vendor-neutral APIs and instrumentation for distributed tracing
Jaeger: A distributed tracing platform

CNCF sandbox projects, the third category not shown in Figure 1.1, include two more monitoring-related projects: OpenMetrics and Cortex. Why is observability in such high demand for cloud-native applications?

Microservices and cloud-native applications

Figure 1.1: Graduated and incubating projects at CNCF as of January 2019. Project names and logos are registered trademarks of the Linux Foundation.

What is observability?

The term "observability" in control theory states that the system is observable if the internal states of the system and, accordingly, its behavior, can be determined by only looking at its inputs and outputs. At the 2018 Observability Practitioners Summit [4], Bryan Cantrill, the CTO of Joyent and one of the creators of the tool dtrace, argued that this definition is not practical to apply to software systems because they are so complex that we can never know their complete internal state, and therefore the control theory's binary measure of observability is always zero (I highly recommend watching his talk on YouTube: https://youtu.be/U4E0QxzswQc). Instead, a more useful definition of observability for a software system is its "capability to allow a human to ask and answer questions". The more questions we can ask and answer about the system, the more observable it is.

Figure 1.2: The Twitter debate

There are also many debates and Twitter zingers about the difference between monitoring and observability. Traditionally, the term monitoring was used to describe metrics collection and alerting. Sometimes it is used more generally to include other tools, such as "using distributed tracing to monitor distributed transactions." The definition by Oxford dictionaries of the verb "monitor" is "to observe and check the progress or quality of (something) over a period of time; keep under systematic review." However, it is better thought of as the process of observing certain a priori defined performance indicators of our software system, such as those measuring an impact on the end user experience, like latency or error counts, and using their values to alert us when these signals indicate an abnormal behavior of the system. Metrics, logs, and traces can all be used as a means to extract those signals from the application. We can then reserve the term "observability" for situations when we have a human operator proactively asking questions that were not predefined. As Brian Cantrill put it in his talk, this process is debugging, and we need to "use our brains when debugging." Monitoring does not require a human operator; it can and should be fully automated.

"If you want to talk about (metrics, logs, and traces) as pillars of observability–great.
The human is the foundation of observability!"
-- Brian Cantrill

In the end, the so-called "three pillars of observability" (metrics, logs, and traces) are just tools, or more precisely, different ways of extracting sensor data from the applications. Even with metrics, the modern time series solutions like Prometheus, InfluxDB, or Uber's M3 are capable of capturing the time series with many labels, such as which host emitted a particular value of a counter. Not all labels may be useful for monitoring, since a single misbehaving service instance in a cluster of thousands does not warrant an alert that wakes up an engineer. But when we are investigating an outage and trying to narrow down the scope of the problem, the labels can be very useful as observability signals.

The observability challenge of microservices

By adopting microservices architectures, organizations are expecting to reap many benefits, from better scalability of components to higher developer productivity. There are many books, articles, and blog posts written on this topic, so I will not go into that. Despite the benefits and eager adoption by companies large and small, microservices come with their own challenges and complexity. Companies like Twitter and Netflix were successful in adopting microservices because they found efficient ways of managing that complexity. Vijay Gill, Senior VP of Engineering at Databricks, goes as far as saying that the only good reason to adopt microservices is to be able to scale your engineering organization and to "ship the org chart" [2].

Vijay Gill's opinion may not be a popular one yet. A 2018 "Global Microservices Trends" study [6] by Dimensional Research® found that over 91% of interviewed professionals are using or have plans to use microservices in their systems. At the same time, 56% say each additional microservice "increases operational challenges," and 73% find "troubleshooting is harder" in a microservices environment. There is even a famous tweet about adopting microservices:

The observability challenge of microservices

Figure 1.3: The tweet in question

Consider Figure 1.4, which gives a visual representation of a subset of microservices in Uber's microservices architecture, rendered by Uber's distributed tracing platform Jaeger. It is often called a service dependencies graph or a topology map. The circles (nodes in the graph) represent different microservices. The edges are drawn between nodes that communicate with each other. The diameter of the nodes is proportional to the number of other microservices connecting to them, and the width of an edge is proportional to the volume of traffic going through that edge.

The picture is already so complex that we don't even have space to include the names of the services (in the real Jaeger UI you can see them by moving the mouse over nodes). Every time a user takes an action on the mobile app, a request is executed by the architecture that may require dozens of different services to participate in order to produce a response. Let's call the path of this request a distributed transaction.

Figure 1.4: A visual representation of a subset of Uber's microservices architecture and a hypothetical transaction

So, what are the challenges of this design? There are quite a few:

In order to run these microservices in production, we need an advanced orchestration platform that can schedule resources, deploy containers, auto-scale, and so on. Operating an architecture of this scale manually is simply not feasible, which is why projects like Kubernetes became so popular.
In order to communicate, microservices need to know how to find each other on the network, how to route around problematic areas, how to perform load balancing, how to apply rate limiting, and so on. These functions are delegated to advanced RPC frameworks or external components like network proxies and service meshes.
Splitting a monolith into many microservices may actually decrease reliability. Suppose we have 20 components in the application and all of them are required to produce a response to a single request. When we run them in a monolith, our failure modes are restricted to bugs and potentially a crush of the whole server running the monolith. But if we run the same components as microservices, on different hosts and separated by a network, we introduce many more potential failure points, from network hiccups, to resource constraints due to noisy neighbors. Even if each microservice succeeds in 99.9% of cases, the whole application that requires all of them to work for a given request can only succeed 0.99920 = 98.0% of the time. Distributed, microservices-based applications must become more complicated, for example, implementing retries or opportunistic parallel reads, in order to maintain the same level of availability.
The latency may also increase. Assume each microservice has 1 ms average latency, but the 99^th percentile is 1s. A transaction touching just one of these services has a 1% chance to take ≥ 1s. A transaction touching 100 of these services has 1 - (1 - 0.01)¹⁰⁰ = 63% chance to take ≥ 1s.
Finally, the observability of the system is dramatically reduced if we try to use traditional monitoring tools.

When we see that some requests to our system are failing or slow, we want our observability tools to tell us the story about what happens to that request. We want to be able to ask questions like these:

Which services did a request go through?
What did every microservice do when processing the request?
If the request was slow, where were the bottlenecks?
If the request failed, where did the error happen?
How different was the execution of the request from the normal behavior of the system?
- Were the differences structural, that is, some new services were called, or vice versa, some usual services were not called?
- Were the differences related to performance, that is, some service calls took a longer or shorter time than usual?
What was the critical path of the request?
And perhaps most importantly, if selfishly, who should be paged?

Unfortunately, traditional monitoring tools are ill-equipped to answer these questions for microservices architectures.

Traditional monitoring tools

Traditional monitoring tools were designed for monolith systems, observing the health and behavior of a single application instance. They may be able to tell us a story about that single instance, but they know almost nothing about the distributed transaction that passed through it. These tools lack the context of the request.

Metrics

It goes like this: "Once upon a time…something bad happened. The end." How do you like this story? This is what the chart in Figure 1.5 tells us. It's not completely useless; we do see a spike and we could define an alert to fire when this happens. But can we explain or troubleshoot the problem?

Figure 1.5: A graph of two time series representing (hypothetically) the volume of traffic to a service

Metrics, or stats, are numerical measures recorded by the application, such as counters, gauges, or timers. Metrics are very cheap to collect, since numeric values can be easily aggregated to reduce the overhead of transmitting that data to the monitoring system. They are also fairly accurate, which is why they are very useful for the actual monitoring (as the dictionary defines it) and alerting.

Yet the same capacity for aggregation is what makes metrics ill-suited for explaining the pathological behavior of the application. By aggregating data, we are throwing away all the context we had about the individual transactions.

In Chapter 11, Integration with Metrics and Logs, we will talk about how integration with tracing and context propagation can make metrics more useful by providing them with the lost context. Out of the box, however, metrics are a poor tool to troubleshoot problems within microservices-based applications.

Logs

Logging is an even more basic observability tool than metrics. Every programmer learns their first programming language by writing a program that prints (that is, logs) "Hello, World!" Similar to metrics, logs struggle with microservices because each log stream only tells us about a single instance of a service. However, the evolving programming paradigm creates other problems for logs as a debugging tool. Ben Sigelman, who built Google's distributed tracing system Dapper [7], explained it in his KubeCon 2016 keynote talk [8] as four types of concurrency (Figure 1.6):

Figure 1.6: Evolution of concurrency

Years ago, applications like early versions of Apache HTTP Server handled concurrency by forking child processes and having each process handle a single request at a time. Logs collected from that single process could do a good job of describing what happened inside the application.

Then came multi-threaded applications and basic concurrency. A single request would typically be executed by a single thread sequentially, so as long as we included the thread name in the logs and filtered by that name, we could still get a reasonably accurate picture of the request execution.

Then came asynchronous concurrency, with asynchronous and actor-based programming, executor pools, futures, promises, and event-loop-based frameworks. The execution of a single request may start on one thread, then continue on another, then finish on the third. In the case of event loop systems like Node.js, all requests are processed on a single thread but when the execution tries to make an I/O, it is put in a wait state and when the I/O is done, the execution resumes after waiting its turn in the queue.

Both of these asynchronous concurrency models result in each thread switching between multiple different requests that are all in flight. Observing the behavior of such a system from the logs is very difficult, unless we annotate all logs with some kind of unique id representing the request rather than the thread, a technique that actually gets us close to how distributed tracing works.

Finally, microservices introduced what we can call "distributed concurrency." Not only can the execution of a single request jump between threads, but it can also jump between processes, when one microservice makes a network call to another. Trying to troubleshoot request execution from such logs is like debugging without a stack trace: we get small pieces, but no big picture.

In order to reconstruct the flight of the request from the many log streams, we need powerful logs aggregation technology and a distributed context propagation capability to tag all those logs in different processes with a unique request id that we can use to stitch those requests together. We might as well be using the real distributed tracing infrastructure at this point! Yet even after tagging the logs with a unique request id, we still cannot assemble them into an accurate sequence, because the timestamps from different servers are generally not comparable due to clock skews. In Chapter 11, Integration with Metrics and Logs, we will see how tracing infrastructure can be used to provide the missing context to the logs.

Metrics

Figure 1.5: A graph of two time series representing (hypothetically) the volume of traffic to a service

Logs

Figure 1.6: Evolution of concurrency

Logs

Figure 1.6: Evolution of concurrency

Distributed tracing

As soon as we start building a distributed system, traditional monitoring tools begin struggling with providing observability for the whole system, because they were designed to observe a single component, such as a program, a server, or a network switch. The story of a single component may no doubt be very interesting, but it tells us very little about the story of a request that touches many of those components. We need to know what happens to that request in all of them, end-to-end, if we want to understand why a system is behaving pathologically. In other words, we first want a macro view.

At the same time, once we get that macro view and zoom in to a particular component that seems to be at fault for the failure or performance problems with our request, we want a micro view of what exactly happened to that request in that component. Most other tools cannot tell that to us either because they only observe what "generally" happens in the component as a whole, for example, how many requests per second it handles (metrics), what events occurred on a given thread (logs), or which threads are on and off CPU at a given point in time (profilers). They don't have the granularity or context to observe a specific request.

Distributed tracing takes a request-centric view. It captures the detailed execution of causally-related activities performed by the components of a distributed system as it processes a given request. In Chapter 3, Distributed Tracing Fundamentals, I will go into more detail on how exactly it works, but in a nutshell:

Tracing infrastructure attaches contextual metadata to each request and ensures that metadata is passed around during the request execution, even when one component communicates with another over a network.
At various trace points in the code, the instrumentation records events annotated with relevant information, such as the URL of an HTTP request or an SQL statement of a database query.
Recorded events are tagged with the contextual metadata and explicit causality references to prior events.

That deceptively simple technique allows the tracing infrastructure to reconstruct the whole path of the request, through the components of a distributed system, as a graph of events and causal edges between them, which we call a "trace." A trace allows us to reason about how the system was processing the request. Individual graphs can be aggregated and clustered to infer patterns of behaviors in the system. Traces can be displayed using various forms of visualizations, including Gantt charts (Figure 1.7) and graph representations (Figure 1.8), to give our visual cortex cues to finding the root cause of performance problems:

Figure 1.7: Jaeger UI view of a single request to the HotROD application, further discussed in chapter 2. In the bottom half, one of the spans (named GetDriver from service redis, with a warning icon) is expanded to show additional information, such as tags and span logs.

Figure 1.8: Jaeger UI view of two traces A and B being compared structurally in the graph form (best viewed in color). Light/dark green colors indicate services that were encountered more/only in trace B, and light/dark red colors indicate services encountered more/only in trace A.

By taking a request-centric view, tracing helps to illuminate different behaviors of the system. Of course, as Bryan Cantrill said in his KubeCon talk, just because we have tracing, it doesn't mean that we eliminated performance pathologies in our applications. We actually need to know how to use it to ask sophisticated questions that we now can ask with this powerful tool. Fortunately, distributed tracing is able to answer all the questions we posed in The observability challenge of microservices section.

My experience with tracing

My first experience with distributed tracing was somewhere around 2010, even though we did not use that term at the time. I was working on a trade capture and trade processing system at Morgan Stanley. It was built as a service-oriented architecture (SOA), and the whole system contained more than a dozen different components deployed as independent Java applications. The system was used for over-the-counter interest rate derivatives products (like swaps and options), which had high complexity but not a huge trading volume, so most of the system components were deployed as a single instance, with the exception of the stateless pricers that were deployed as a cluster.

One of the observability challenges with the system was that each trade had to go through a complicated sequence of additional changes, matching, and confirmation flows, implemented by the different components of the system.

To give us visibility into the various state transitions of the individual trades, we used an APM vendor (now defunct) that was essentially implementing a distributed tracing platform. Unfortunately, our experience with that technology was not particularly stellar, with the main challenge being the difficulty of instrumenting our applications for tracing, which involved creating aspect-oriented programming (AOP) - style instructions in the XML files and trying to match on the signature of the internal APIs. The approach was very fragile, as changes to the internal APIs would cause the instrumentation to become ineffective, without good facilities to enforce it via unit testing. Getting instrumentation into existing applications is one of the main difficulties in adopting distributing tracing, as we will discuss in this book.

When I joined Uber in mid-2015, the engineering team in New York had only a handful of engineers, and many of them were working in the metrics system, which later became known as M3. At the time, Uber was just starting its journey towards breaking the existing monolith and replacing it with microservices. The Python monolith, appropriately called "API", was already instrumented with another home-grown tracing-like system called Merckx.

The major shortcoming with Merckx was its design for the days of a monolithic application. It lacked any concept of distributed context propagation. It recorded SQL queries, Redis calls, and even calls to other services, but there was no way to go more than one level deep. It also stored the existing in-process context in a global, thread-local storage, and when many new Python microservices at Uber began adopting an event-loop-based framework Tornado, the propagation mechanism in Merckx was unable to represent the state of many concurrent requests running on the same thread. By the time I joined Uber, Merckx was in maintenance mode, with hardly anyone working on it, even though it had active users. Given the new observability theme of the New York engineering team, I, along with another engineer, Onwukike Ibe, took the mantle of building a fully-fledged distributed tracing platform.

I had no experience with building such systems in the past, but after reading the Dapper paper from Google, it seemed straightforward enough. Plus, there was already an open source clone of Dapper, the Zipkin project, originally built by Twitter. Unfortunately, Zipkin did not work for us out of the box.

In 2014, Uber started building its own RPC framework called TChannel. It did not really become popular in the open source world, but when I was just getting started with tracing, many services at Uber were already using that framework for inter-process communications. The framework came with tracing instrumentation built-in, even natively supported in the binary protocol format. So, we already had traces being generated in production, only nothing was gathering and storing them.

I wrote a simple collector in Go that was receiving traces generated by TChannel in a custom Thrift format and storing them in the Cassandra database in the same format that the Zipkin project used. This allowed us to deploy the collectors alongside the Zipkin UI, and that's how Jaeger was born. You can read more about this in a post on the Uber Engineering blog [9].

Having a working tracing backend, however, was only half of the battle. Although TChannel was actively used by some of the newer services, many more existing services were using plain JSON over HTTP, utilizing many different HTTP frameworks in different programming languages. In some of the languages, for example, Java, TChannel wasn't even available or mature enough. So, we needed to solve the same problem that made our tracing experiment at Morgan Stanley fizzle out: how to get tracing instrumentation into hundreds of existing services, implemented with different technology stacks.

As luck would have it, I was attending one of the Zipkin Practitioners workshops organized by Adrian Cole from Pivotal, the lead maintainer of the Zipkin project, and that same exact problem was on everyone's mind. Ben Sigelman, who founded his own observability company Lightstep earlier that year, was at the workshop too, and he proposed to create a project for a standardized tracing API that could be implemented by different tracing vendors independently, and could be used to create completely vendor-neutral, open source, reusable tracing instrumentation for many existing frameworks and drivers. We brainstormed the initial design of the API, which later became the OpenTracing project [10] (more on that in Chapter 6, Tracing Standards and Ecosystem). All examples in this book use the OpenTracing APIs for instrumentation.

The evolution of the OpenTracing APIs, which is still ongoing, is a topic for another story. Yet even the initial versions of OpenTracing gave us the peace of mind that if we started adopting it on a large scale at Uber, we were not going to lock ourselves into a single implementation. Having different vendors and open source projects participating in the development of OpenTracing was very encouraging. We implemented Jaeger-specific, fully OpenTracing-compatible tracing libraries in several languages (Java, Go, Python, and Node.js), and started rolling them out to Uber microservices. Last time I checked, we had close to 2,400 microservices instrumented with Jaeger.

I have been working in the area of distributed tracing even since. The Jaeger project has grown and matured. Eventually, we replaced the Zipkin UI with Jaeger's own, more modern UI built with React, and in April 2017, we open sourced all of Jaeger, from client libraries to the backend components.

By supporting OpenTracing, we were able to rely on the ever-growing ecosystem of open source instrumentation hosted at the opentracing-contrib organization on GitHub [11], instead of writing our own the way some other projects have done. This freed the Jaeger developers to focus on building a best-of-class tracing backend with data analysis and visualization features. Many other tracing solutions have borrowed features first introduced in Jaeger, just like Jaeger borrowed its initial feature set from Zipkin.

In the fall of 2017, Jaeger was accepted as an incubating project to CNCF, following in the footsteps of the OpenTracing project. Both projects are very active, with hundreds of contributors, and are used by many organizations around the world. The Chinese giant Alibaba even offers hosted Jaeger as part of its Alibaba Cloud services [12]. I probably spend 30-50% of my time at work collaborating with contributors to both projects, including code reviews for pull requests and new feature designs.

Why this book?

When I began studying distributed tracing after joining Uber, there was not a lot of information out there. The Dapper paper gave the foundational overview and the technical report by Raja Sambasivan and others [13] provided a very useful historical background. But there was little in the way of a recipe book that would answer more practical questions, such as:

Where do I start with tracing in a large organization?
How do I drive adoption of tracing instrumentation across existing systems?
How does the instrumentation even work? What are the basics? What are the recommended patterns?
How do I get the most benefit and return on investment from tracing?
What do I do with all that tracing data?
How do I operate a tracing backend in real production and not in a toy application?

In the early 2018, I realized that I had pretty good answers to these questions, while most people who were just starting to look into tracing still didn't, and no comprehensive guide has been published anywhere. Even the basic instrumentation steps are often confusing to people if they do not understand the underlying concepts, as evidenced by the many questions posted in the Jaeger and OpenTracing chat rooms.

When I gave the OpenTracing tutorial at the Velocity NYC conference in 2017, I created a GitHub repository that contained step-by-step walkthroughs for instrumentation, from a basic "Hello, World!" program to a small microservices-based application. The tutorials were repeated in several programming languages (I originally created ones for Java, Go, and Python, and later other people created more, for Node.js and C#). I have seen time and again how these most simple tutorials help people to learn the ropes:

Figure 1.9: Feedback about a tutorial

So, I was thinking, maybe I should write a book that would cover not just the instrumentation tutorials, but give a comprehensive overview of the field, from its history and fundamentals to practical advice about where to start and how to get the most benefits from tracing. To my surprise, Andrew Waldron from Packt Publishing reached out to me offering to do exactly that. The rest is history, or rather, this book.

One aspect that made me reluctant to start writing was the fact that the boom of microservices and serverless created a big gap in the observability solutions that can address the challenges posed by these architectural styles, and tracing is receiving a lot of renewed interest, even though the basic idea of distributed tracing systems is not new. Accordingly, there are a lot of changes happening in this area, and there was a risk that anything I wrote would quickly become obsolete. It is possible that in the future, OpenTracing might be replaced by some more advanced API. However, the thought that made me push through was that this book is not about OpenTracing or Jaeger. I use them as examples because they are the projects that are most familiar to me. The ideas and concepts introduced throughout the book are not tied to these projects. If you decide to instrument your applications with Zipkin's Brave library, or with OpenCensus, or even with some vendor's proprietary API, the fundamentals of instrumentation and distributed tracing mechanics are going to be the same, and the advice I give in the later chapters about practical applications and the adoption of tracing will still apply equally.

Summary

In this chapter, we took a high-level look at observability problems created by the new popular architectural styles, microservices and FaaS, and discussed why traditional monitoring tools are failing to fill this gap, whereas distributed tracing provides a unique way of getting both a macro and micro view of the system behavior when it executes individual requests.

I have also talked about my own experience and history with tracing, and why I wrote this book as a comprehensive guide to many engineers coming to the field of tracing.

In the next chapter, we are going to take a hands-on deep dive into tracing, by running a tracing backend and a microservices-based demo application. It will complement the claims made in this introduction with concrete examples of the capabilities of end-to-end tracing.

References

Martin Fowler, James Lewis. Microservices: a definition of this new architectural term: https://www.martinfowler.com/articles/microservices.html.
Cloud Native Computing Foundation (CNCF) Charter: https://github.com/cncf/foundation/blob/master/charter.md.
CNCF projects: https://www.cncf.io/projects/.
Bryan Cantrill. Visualizing Distributed Systems with Statemaps. Observability Practitioners Summit at KubeCon/CloudNativeCon NA 2018, December 10: https://sched.co/HfG2.
Vijay Gill. The Only Good Reason to Adopt Microservices: https://lightstep.com/blog/the-only-good-reason-to-adopt-microservices/.
Global Microservices Trends Report: https://go.lightstep.com/global-microservices-trends-report-2018.
Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed system tracing infrastructure. Technical Report dapper-2010-1, Google, April 2010.
Ben Sigelman. Keynote: OpenTracing and Containers: Depth, Breadth, and the Future of Tracing. KubeCon/CloudNativeCon North America, 2016, Seattle: https://sched.co/8fRU.
Yuri Shkuro. Evolving Distributed Tracing at Uber Engineering. Uber Eng Blog, February 2, 2017: https://eng.uber.com/distributed-tracing/.
The OpenTracing Project: http://opentracing.io/.
The OpenTracing Contributions: https://github.com/opentracing-contrib/.
Alibaba Cloud documentation. OpenTracing implementation of Jaeger: https://www.alibabacloud.com/help/doc-detail/68035.htm.
Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, Gregory R. Ganger. So, You Want to Trace Your Distributed System? Key Design Insights from Years of Practical Experience. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102. April 2014.

Chapter 2. Take Tracing for a HotROD Ride

A picture is worth a thousand words. So far, we have only talked about distributed tracing in the abstract terms. In this chapter, we are going to look at concrete examples of the diagnostic and troubleshooting tools provided by a tracing system. We are going to use Jaeger (pronounced \ Take Tracing for a HotROD Ride \), an open source distributed tracing system, originally created by Uber Technologies [1] and now hosted with the Cloud Native Computing Foundation [2].

The chapter will:

Introduce HotROD, an example application provided by the Jaeger project, which is built with microservices and instrumented with the OpenTracing API (we will discuss OpenTracing in detail in Chapter 4, Instrumentation Basics with OpenTracing)
Use Jaeger's user interface to understand the architecture and the data flow of the HotROD application
Compare standard logging output of the application with contextual logging capabilities of distributed tracing
Investigate and attempt to fix the root causes of latency in the application
Demonstrate the distributed context propagation features of OpenTracing

Prerequisites

All relevant screenshots and code snippets are included in this chapter, but you are strongly encouraged to try running the example and explore the features of the web UIs, in order to better understand the capabilities of distributed tracing solutions like Jaeger.

Both the Jaeger backend and the demo application can be run as downloadable binaries for macOS, Linux, and Windows, as Docker containers, or directly from the source code. Since Jaeger is an actively developed project, by the time you read this book, some of the code organization or distributions may have changed. To ensure you are following the same steps as described in this chapter, we are going to use Jaeger version 1.6.0, released in July 2018.

Running from prepackaged binaries

Downloading pre packaged binaries is likely the easiest way to get started, since it requires no additional setup or installations. All Jaeger backend components, as well as the HotROD demo application, are available as executable binaries for macOS, Linux, and Windows from GitHub: https://github.com/jaegertracing/jaeger/releases/tag/v1.6.0/. For example, to run the binaries on macOS, download the archive jaeger-1.6.0-darwin-amd64.tar.gz and extract its content into a directory:

$ tar xvfz jaeger-1.6.0-darwin-amd64.tar.gz
x jaeger-1.6.0-darwin-amd64/
x jaeger-1.6.0-darwin-amd64/example-hotrod
x jaeger-1.6.0-darwin-amd64/jaeger-query
x jaeger-1.6.0-darwin-amd64/jaeger-standalone
x jaeger-1.6.0-darwin-amd64/jaeger-agent
x jaeger-1.6.0-darwin-amd64/jaeger-collector

This archive includes the production-grade binaries for the Jaeger backend, namely jaeger-query, jaeger-agent, and jaeger-collector, which we will not use in this chapter. We only need the all-in-one packaging of the Jaeger backend jaeger-standalone, which combines all the backend components into one executable, with no additional dependencies.

The Jaeger backend listens on half a dozen different ports, so if you run into port conflicts, you may need to find out which other software listens on the same ports and shut it down temporarily. The risk of port conflicts is greatly reduced if you run Jaeger all-in-one as a Docker container.

The executable example-hotrod is the HotROD demo application we will be using in this chapter to demonstrate the features of distributed tracing.

Running from Docker images

Like most software designed for the cloud-native era, Jaeger components are distributed in the form of Docker container images, so we recommend installing a working Docker environment. Please refer to the Docker documentation (https://docs.docker.com/install/) for installation instructions. To quickly verify your Docker setup, run the following command:

$ docker run --rm jaegertracing/all-in-one:1.6 version

You may first see some Docker output as it downloads the container images, followed by the program's output:

{"gitCommit":"77a057313273700b8a1c768173a4c663ca351907","GitVersion":"v1.6.0","BuildDate":"2018-07-10T16:23:52Z"}

Running from the source code

The HotROD demo application contains some standard tracing instrumentation, as well as a number of custom instrumentation techniques that give more insight into the application's behavior and performance. We will discuss these techniques in this chapter and show some code examples. If you want to dive deeper into the code and maybe tweak it, we recommend downloading the full source code.

Go language development environment

The HotROD application, just like Jaeger itself, is implemented in the Go language, thus a working Go v1.10 or later development environment is required to run it from source. Please refer to the documentation (https://golang.org/doc/install) for installation instructions.

Jaeger source code

The HotROD application is located in the examples directory of the main Jaeger backend repository, https://github.com/jaegertracing/jaeger/. If you have Git installed, you can check out the source code as follows:

$ mkdir -p $GOPATH/src/github.com/jaegertracing
$ cd $GOPATH/src/github.com/jaegertracing
$ git clone https://github.com/jaegertracing/jaeger.git jaeger
$ cd jaeger
$ git checkout v1.6.0

Alternatively, you can download the source code bundle from the Jaeger release page (https://github.com/jaegertracing/jaeger/releases/tag/v1.6.0/), and make sure that the code is extracted into the $GOPATH/src/github.com/jaegertracing/jaeger/ directory.

Jaeger 1.6 uses the github.com/Masterminds/glide utility as the dependencies manager, so you would need to install it:

$ go get -u github.com/Masterminds/glide

After installing glide, run it to download the libraries that Jaeger depends on:

$ cd $GOPATH/src/github.com/jaegertracing/jaeger/
$ glide install

Now you should be able to build and run the HotROD binary:

$ go run ./examples/hotrod/main.go help
HotR.O.D. - A tracing demo application.

Usage:
  jaeger-demo [command]

Available Commands:
  all         Starts all services
  customer    Starts Customer service
  driver      Starts Driver service
  frontend    Starts Frontend service
  help        about any command
  route       Starts Route service
[... skipped ...]

It is also possible to run the Jaeger backend from the source code. However, it requires an additional setup of Node.js in order to compile the static assets for the UI, which may not even work on an OS like Windows, so I do not recommend it for this chapter's examples.

Start Jaeger

Before we run the demo application, let's make sure we can run the Jaeger backend to collect the traces, as otherwise we might get a lot of error logs. A production installation of the Jaeger backend would consist of many different components, including some highly scalable databases like Cassandra or Elasticsearch. For our experiments, we do not need that complexity or even the persistence layer. Fortunately, the Jaeger distribution includes a special component called all-in-one just for this purpose. It runs a single process that embeds all other components of a normal Jaeger installation, including the web user interface. Instead of a persistent storage, it keeps all traces in memory.

If you are using Docker, you can run Jaeger all-in-one with the following command:

$ docker run -d --name jaeger \
    -p 6831:6831/udp \
    -p 16686:16686 \
    -p 14268:14268 \
    jaegertracing/all-in-one:1.6

The -d flag makes the process run in the background, detached from the terminal. The --name flag sets a name by which this process can be located by other Docker containers. We also use the -p flag to expose three ports on the host network that the Jaeger backend is listening to.

The first port, 6831/udp, is used to receive tracing data from applications instrumented with Jaeger tracers, and the second port, 16686, is where we can find the web UI. We also map the third port, 14268, in case we have issues with UDP packet limits and need to use HTTP transport for sending traces (discussed as follows).

The process listens to other ports as well, for example, to accept traces in other formats, but they are not relevant for our exercise. Once the container starts, open http://127.0.0.1:16686/ in the browser to access the UI.

If you chose to download the binaries instead of Docker images, you can run the executable named jaeger-standalone, without any arguments, which will listen on the same ports. jaeger-standalone is the binary used to build the jaegertracing/all-in-one Docker image (in the later versions of Jaeger, it has been renamed jaeger-all-in-one).

$ cd jaeger-1.6.0-darwin-amd64/
$ ./jaeger-standalone
[... skipped ...]
{"msg":"Starting agent"}
{"msg":"Starting jaeger-collector TChannel server","port":14267}
{"msg":"Starting jaeger-collector HTTP server","http-port":14268}
[... skipped ...]
{"msg":"Starting jaeger-query HTTP server","port":16686}
{"msg":"Health Check state change","status":"ready"}
[... skipped ...]

We removed some fields of the log statements (level, timestamp, and caller) to improve readability.

Figure 2.1: Jaeger UI front page: the search screen

Since the all-in-one binary runs the Jaeger backend with an in-memory database, which is initially empty, there is not much to see in the UI right away. However, the Jaeger backend has self-tracing enabled, so if we reload the home page a few times, we will see the Services dropdown in the top-left corner display jaeger-query, which is the name of the microservice running the UI component. We can now hit the Search button to find some traces, but let's first run the demo application to get more interesting traces.

Meet the HotROD

HotROD is a mock-up "ride-sharing" application (ROD stands for Rides on Demand) that is maintained by the Jaeger project. We will discuss its architecture later, but first let's try to run it. If you are using Docker, you can run it with this command:

$ docker run --rm -it \
    --link jaeger \
    -p8080-8083:8080-8083 \
    jaegertracing/example-hotrod:1.6 \
    all \
    --jaeger-agent.host-port=jaeger:6831

Let's quickly review what is going on with this command:

The rm flag instructs Docker to automatically remove the container once the program exits.
The it flags attach the container's standard in and out streams to the terminal.
The link flag tells Docker to make the hostname jaeger available inside the container's networking namespace and resolve it to the Jaeger backend we started earlier.
The string all, after the image name, is the command to the HotROD application, instructing it to run all microservices from the same process. It is possible to run each microservice as a separate process, even on different machines, to simulate a truly distributed application, but we leave this as an exercise for you.
The final flag tells the HotROD application to configure the tracer to send data to UDP port 6831 on hostname jaeger.

To run HotROD from downloaded binaries, run the following command:

$ example-hotrod all

If we are running both the Jaeger all-in-one and the HotROD application from the binaries, they bind their ports directly to the host network and are able to find each other without any additional configuration, due to the default values of the flags.

Sometimes users experience issues with getting traces from the HotROD application due to the default UDP settings in the OS. Jaeger client libraries batch up to 65,000 bytes per UDP packet, which is still a safe number to send via the loopback interface (that is, localhost) without packet fragmentation. However, macOS, for example, has a much lower default for the maximum datagram size. Rather than adjusting the OS settings, another alternative is to use the HTTP protocol between Jaeger clients and the Jaeger backend. This can be done by passing the following flag to the HotROD application:

--jaeger-agent.host-port=http://localhost:14268/api/traces

Or, if using the Docker networking namespace:

--jaeger-agent.host-port=http://jaeger:14268/api/traces

Once the HotROD process starts, the logs written to the standard output will show the microservices starting several servers on different ports (for better readability, we removed the timestamps and references to the source files):

INFO Starting all services
INFO Starting {"service": "route", "address": "http://127.0.0.1:8083"}
INFO Starting {"service": "frontend", "address": "http://127.0.0.1:8080"}
INFO Starting {"service": "customer", "address": "http://127.0.0.1:8081"}
INFO TChannel listening {"service": "driver", "hostPort": "127.0.0.1:8082"}

Let's navigate to the application's web frontend at http://127.0.0.1:8080/:

Figure 2.2: The HotROD single-page web application

We have four customers, and by clicking one of the four buttons, we summon a car to arrive to the customer's location, perhaps to pick up a product and deliver it elsewhere. Once a request for a car is sent to the backend, it responds with the car's license plate number, T757183C, and the expected time of arrival of two minutes:

Figure 2.3: We ordered a car that will arrive in two minutes

There are a few bits of debugging information we see on the screen:

In the top-left corner, there is a web client id: 6480. This is a random session ID assigned by the JavaScript UI. If we reload the page, we get a different session ID.
In the brackets after the car information, we see a request ID, req: 6480-1. This is a unique ID assigned by the JavaScript UI to each request it makes to the backend, composed of the session ID and a sequence number.
The last bit of debugging data, latency: 772ms, is measured by the JavaScript UI and shows how long the backend took to respond.

This additional information has no impact on the behavior of the application but will be useful when we investigate performance problems.

The architecture

Now that we have seen what the HotROD application does, we may want to know how it is architected. After all, maybe all those servers we saw in the logs are just for show, and the whole application is simply a JavaScript frontend. Rather than asking someone for a design document, wouldn't it be great if our monitoring tools could build the architecture diagram automatically, by observing the interactions between the services? That's exactly what distributed tracing systems like Jaeger can do. That request for a car we executed earlier has provided Jaeger with enough data to connect the dots.

Let's go to the Dependencies page in the Jaeger UI. At first, we will see a tiny diagram titled Force Directed Graph, but we can ignore it, as that particular view is really designed for showing architectures that contain hundreds or even thousands of microservices. Instead, click on the DAG tab (Directed Acyclic Graph), which shows an easier-to-read graph. The graph layout is non-deterministic, so your view may have the second-level nodes in a different order than in the following screenshot:

Figure 2.4: Empirically constructed diagram of a services call graph

As it turns out, the single HotROD binary is actually running four microservices and, apparently, two storage backends: Redis and MySQL. The storage nodes are not actually real: they are simulated by the application as internal components, but the top four microservices are indeed real. We saw each of them logging a network address of the servers they run. The frontend microservice serves the JavaScript UI and makes RPC calls to the other three microservices.

The graph also shows the number of calls that were made to handle the single request for a car, for example, the route service was called 10 times, and there were 14 calls to Redis.

The data flow

We have learned that the application consists of several microservices. What exactly does the request flow look like? It is time to look at the actual trace. Let's go to the Search page in the Jaeger UI. Under the Find Traces caption, the Services dropdown contains the names of the services we saw in the dependency diagram. Since we know that frontend is the root service, let's choose it and click the Find Traces button.

Figure 2.5: Results of searching for all traces in the last hour from the service frontend

The system found two traces and displayed some metadata about them, such as the names of different services that participated in the traces, and the number of spans each service emitted to Jaeger. We will ignore the second trace that represents the request to load the JavaScript UI and focus on the first trace, named frontend: HTTP GET /dispatch. This name is a concatenation of the service name frontend and the operation name of the top-level span, in this case HTTP GET /dispatch.

On the right side, we see that the total duration of the trace was 757.73ms. This is shorter than the 772ms we saw in the HotROD UI, which is not surprising because the latter was measured from the HTTP client side by JavaScript, while the former was reported by the Go backend. The 14.27ms difference between these numbers can be attributed to the network latency. Let's click on the trace title bar.

Figure 2.6: Trace timeline view. At the top is the name of the trace, which is combined from the service name and the operation name of the root span. On the left is the hierarchy of calls between microservices, as well as within microservices (internal operations can also be represented as spans). The calls from the frontend service to the route service are collapsed to save space. Some of the calls from the driver service to Redis have red circles with white exclamation points in them, indicating an error in the operation. On the right side is the Gantt chart showing spans on the horizontal timeline. The Gantt chart is interactive and clicking on a span can reveal additional information.

The timeline view shows a typical view of a trace as a time sequence of nested spans, where a span represents a unit of work within a single service. The top-level span, also called the root span, represents the handling of the main HTTP request from the JavaScript UI by the frontend service (server span), which in turn called the customer service, which in turn called a MySQL database. The width of the spans is proportional to the time a given operation took. This may represent a service doing some work or waiting on a downstream call.

From this view, we can see how the application handles a request:

The frontend service receives the external HTTP GET request at its /dispatch endpoint.
The frontend service makes an HTTP GET request to the /customer endpoint of the customer service.
The customer service executes SELECT SQL statement in MySQL. The results are returned back to the frontend service.
Then the frontend service makes an RPC request, Driver::findNearest, to the driver service. Without drilling more into the trace details, we cannot tell which RPC framework is used to make this request, but we can guess it is not HTTP (it is actually made over TChannel [1]).
The driver service makes a series of calls to Redis. Some of those calls show a red circle with an exclamation point, indicating failures.
After that, the frontend service executes a series of HTTP GET requests to the /route endpoint of the route service.
Finally, the frontend service returns the result to the external caller (for example, the UI).

We can tell all of this pretty much just by looking at the high-level Gantt chart presented by the end-to-end tracing tool.

Contextualized logs

We now have a pretty good idea about what the HotROD application does, if not exactly how it does it. For example, why does the frontend service call the /customer endpoint of the customer service? Of course, we can look at the source code, but we are trying to approach this from the point of view of application monitoring. One direction we could take is to look at the logs the application writes to its standard output (Figure 2.7).

Figure 2.7: Typical logging output

It is quite difficult to follow the application logic from these logs and we are only looking at the logs when a single request was executed by the application.

We are also lucky that the logs from four different microservices are combined in a more-or-less consistent stream. Imagine many concurrent requests going through the system and through microservices running in different processes! The logs would become nearly useless in that case. So, let's take a different approach. Let's view the logs collected by the tracing system. For example, click on the root span to expand it and then click on the Logs (18) section to expand and see the logs (18 refers to the number of log statements captured in this span). These logs give us more insight into what the /dispatch endpoint was doing (Figure 2.8):

It called the customer service to retrieve some information about the customer given customer_id=123.
It then retrieved N available drivers closest to the customer location (115,277). From the Gantt chart, we know this was done by calling the driver service.
Finally, it called the route service to find the shortest route between driver location (indicated as "pickup") and customer location (indicated as "drop-off").

Figure 2.8: Logs recorded by the tracing system in the root span. The hostname is masked in all screenshots for privacy.

Let's close the root span and open another one; specifically, one of the failed calls to Redis (Figure 2.9). The span has a tag error=true, which is why the UI highlighted it as failed. The log statement explains the nature of the error as "Redis timeout." The log also includes the driver_id that the driver service was attempting to retrieve from Redis. All these details may provide very useful information during debugging.

Figure 2.9: Expanded span details after clicking on a failed GetDriver span, in redis service which is marked with a white exclamation point in a red circle. The log entry explains that this was a redis timeout and indicates which driver ID was queried from the database.

The distinct feature of a tracing system is that it only shows the logs that happened during the execution of a given request. We call these logs contextualized because they are captured not only in the context of a specific request, but also in the context of a specific span within the trace for that request.

In the traditional log output, these log statement would have been mixed with a lot of other statements from parallel requests, but in the tracing system, they are neatly isolated to the service and span where they are relevant. Contextualized logs allow us to focus on the behavior of the application, without worrying about logs from other parts of the program or from other concurrent requests.

As we can see, using a combination of a Gantt chart, span tags, and span logs, the end-to-end tracing tool lets us easily understand the architecture and data flow of the application, and enables us to zoom in on the details of individual operations.

Span tags versus logs

Let's expand a couple more spans.

Figure 2.10: Two more spans expanded to show a variety of tags and logs. Each span also contains a section called "Process" that also looks like a collection of tags. The process tags describe the application that was producing the tracing record, rather than an individual span.

In the customer span, we can see a tag http.url that shows that the request at the /customer endpoint had a parameter customer=123, as well as two logs narrating the execution during that span. In the mysql span, we see an sql.query tag showing the exact SQL query that was executed: SELECT * FROM customer WHERE customer_id=123, and a log about acquiring some lock.

What is the difference between a span tag and a span log? They are both annotating the span with some contextual information. Tags typically apply to the whole span, while logs represent some events that happened during the span execution. A log always has a timestamp that falls within the span's start-end time interval. The tracing system does not explicitly track causality between logged events the way it keeps track of causality relationships between spans, because it can be inferred from the timestamps.

An acute reader will notice that the /customer span records the URL of the request twice, in the http.url tag and in the first log. The latter is actually redundant but was captured in the span because the code logged this information using the normal logging facility, which we will discuss later in this chapter.

The OpenTracing Specification [3] defines semantic data conventions that prescribe certain well-known tag names and log fields for common scenarios. Instrumentation is encouraged to use those names to ensure that the data reported to the tracing system is well defined and portable across different tracing backends.

Identifying sources of latency

So far, we have not discussed the performance characteristics of the HotROD application. If we refer to Figure 2.6, we can easily make the following conclusions:

The call to the customer service is on the critical path because no other work can be done until we get back the customer data that includes the location to which we need to dispatch the car.
The driver service retrieves N nearest drivers given the customer's location and then queries Redis for each driver's data in a sequence, which can be seen in the staircase pattern of Redis GetDriver spans. If these operations can be done in parallel, the overall latency can be reduced by almost 200ms.
The calls to the route service are not sequential, but not fully parallel either. We can see that, at most, three requests can be in progress, and as soon as one of them ends, another request starts. This behavior is typical when we use a fixed-size executor pool.

Figure 2.11: Recognizing the sources of latency. The call to mysql appears to be on the critical path and takes almost 40% of the trace time, so clearly it is a good target for some optimization. The calls from the driver service to Redis look like a staircase, hinting at a strictly sequential execution that perhaps can be done in parallel to expedite the middle part of the trace.

Figure 2.12: Recognizing sources of latency (continued). This screenshot is taken after we used a zoom-in feature in the mini-map to only look at the last 200ms of the trace (by dragging the mouse horizontally across the area of interest). It is easy to see that the requests from the frontend service to the route service are done in parallel, but no more than three requests at a time. Red arrows point out how as soon as one request ends, another one starts. This pattern indicates some sort of contention, most likely a worker pool that only has three workers.

What happens if we make many requests to the backend simultaneously? Let's go to the HotROD UI and click on one of the buttons repeatedly (and quickly).

Figure 2.13: Executing many requests simultaneously shows increasing latency of the responses

As we can see, the more requests that are being processed concurrently, the longer it takes for the backend to respond. Let's take a look at the trace of the longest request. We could do it in two ways. We can simply search for all traces and pick the one with the highest latency, represented by the longest cyan-colored title bar (Figure 2.14):

Figure 2.14: Multiple traces returned in the search results, sorted by most recent first

Another way is to search by tags or logs on the span. The root span emits a final log, where it records the license plate number of the closest car as one of the log fields:

Figure 2.15: License plate number T796774C recorded in one of the log events as the field driver. Each log entry can be individually expanded to show fields in a table, as opposed to a single row.

The Jaeger backend indexes all spans by both tags and log fields, and we can find that trace by specifying driver=T796774C in the Tags search box:

Figure 2.16: Searching for a single trace by a field in a log entry: driver=T796774C

This trace took 1.43 seconds, about 90% longer than our first trace, which took only 757ms (measured from the server side). Let's open it and investigate what is different:

Figure 2.17: Higher-latency trace. The database query (mysql span) took 1s, significantly longer than the 300ms or so that it took when only a single request was processed by the application.

The most apparent difference is that the database query (the mysql span) takes a lot longer than before: 1s instead of 323ms. Let's expand that span and try to find out why:

Figure 2.18: Inspecting a very long mysql span

In the log entries of the span, we see that execution was blocked waiting for a lock for more than 700ms. This is clearly a bottleneck in the application, but before we dive into that, let's look at the first log record, evidently emitted before getting blocked on the lock: Waiting for lock behind 3 transactions. Blockers=[6480-4 6480-5 6480-6]. It tells us how many other requests were already queued for this lock, and even gives us the identity of those requests. It is not too hard to imagine a lock implementation that keeps track of how many goroutines are blocked, but where would it get the identity of the requests?

If we expand the previous span for the customer service, we can see that the only data passed to it via an HTTP request was the customer ID 392. In fact, if we inspect every span in the trace, we will not find any remote call where the request ID, like 6480-5, was passed as a parameter.

Figure 2.19: The parent span of the database call expanded. It represents an HTTP call from the frontend service to the customer service. The Tags section of the span is expanded to show tags in a tabular format. The http.url tag shows that customer=392 was the only parameter passed by the caller to the HTTP endpoint.

This magic appearance of blocking request IDs in the logs is due to a custom instrumentation in HotROD that makes use of a distributed context propagation mechanism, which is called baggage in the OpenTracing API.

As we will see in Chapter 3, Distributed Tracing Fundamentals, end-to-end tracing works because tracing instrumentation is designed to propagate certain metadata across thread, component, and process boundaries, throughout the whole distributed call graph. Trace and span IDs are examples of such metadata. Another example is baggage, which is a general-purpose key-value store embedded in every inter-process request. HotROD's JavaScript UI stores session ID and request ID in the baggage before making the request to the backend, and that baggage is transparently made available via OpenTracing instrumentation to every service involved in handling the request, without the need to pass that information explicitly as request parameters.

It is an extremely powerful technique that can be used to propagate various useful pieces of information (such as tenancy) in the context of a single request throughout the architecture, without needing to change every service to understand what they are propagating. We will discuss more examples of using metadata propagation for monitoring, profiling, and other use cases in Chapter 10, Distributed Context Propagation.

In our example, knowing the identities of the requests stuck in the queue ahead of our slow request allows us to find traces for those requests, and analyze them as well. In real production systems, this could lead to unexpected discoveries, such as a long-running request spoiling a lot of other requests that are normally very fast. Later in this chapter, we will see another example of using baggage.

Now that we know that the mysql call gets stuck on a lock, we can easily fix it. As we mentioned earlier, the application does not actually use the MySQL database, just a simulation of it, and the lock is meant to represent a single database connection shared between multiple goroutines. We can find the code in the file examples/hotrod/services/customer/database.go:

if !config.MySQLMutexDisabled {
    // simulate misconfigured connection pool that only gives
    // one connection at a time
    d.lock.Lock(ctx)
    defer d.lock.Unlock()
}

// simulate db query delay
delay.Sleep(config.MySQLGetDelay, config.MySQLGetDelayStdDev)

If the locking behavior is not disabled via configuration, we acquire the lock before simulating the SQL query delay. The statement defer d.lock.Unlock() is used to release the lock before we exit the surrounding function.

Notice how we pass the ctx parameter to the lock object. context.Context is a standard way in Go to pass request-scoped data throughout an application. The OpenTracing span is stored in the Context, which allows the lock to inspect it and retrieve the JavaScript's request ID from the baggage. The code for this custom implementation of a mutex can be found in the source file examples/hotrod/pkg/tracing/mutex.go.

We can see that the mutex behavior is protected by a configuration parameter

if !config.MySQLMutexDisabled {
    // . . .
}

Fortunately, the HotROD applications expose command line flags to change these configuration parameters. We can find the flags by running the HotROD binary with help command:

$ ./example-hotrod help
HotR.O.D. - A tracing demo application.
[... skipped ...]
Flags:
  -D, --fix-db-query-delay, duration     Average lagency of MySQL DB
                                         query (default 300ms)
  -M, --fix-disable-db-conn-mutex        Disables the mutex guarding
                                         db connection
  -W, --fix-route-worker-pool-size, int  Default worker pool size
                                         (default 3)
[... skipped ...]

The flags that control parameters for latency-affecting logic all start with the --fix prefix. In this case, we want the flag --fix-disable-db-conn-mutex, or -M as a short form, to disable the blocking behavior. We also want to reduce the default 300ms latency of simulated database queries, controlled by flag -D, to make it easier to see the results of this optimization.

Let's restart the HotROD application using these flags, to pretend that we fixed the code to use a connection pool with enough capacity that our concurrent requests do not have to compete for connections (the logs are again trimmed for better readability):

$ ./example-hotrod -M -D 100ms all
INFO Using expvar as metrics backend
INFO fix: overriding MySQL query delay {"old": "300ms", "new": "100ms"}
INFO fix: disabling db connection mutex
INFO Starting all services

We can see in the logs that the changes are taking effect. To see how it works out, reload the HotROD web page and repeat the experiment of issuing many simultaneous requests by clicking one of the buttons many times in quick succession.

Figure 2.20: Partially improved request latency after "fixing" the database query bottleneck. No requests run longer than a second, but some are still pretty slow; for example, request #5 is still 50% slower than request #1.

The latency still increases as we add more requests to the system, but it no longer grows as dramatically as with the single database bottleneck from before. Let's look at one of the longer traces again.

Figure 2.21: Trace of another pretty slow request, after removing the database query bottleneck

As expected, the mysql span stays at around 100ms, regardless of the load. The driver span is not expanded, but it takes the same time as before. The interesting change is in the route calls, which now take more than 50% of the total request time. Previously, we saw these requests executing in parallel three at a time, but now we often see only one at a time, and even a gap right after the frontend to driver call when no requests to route service are running. Clearly, we have a contention with other goroutines on some limited resource and we can also see that the gaps happen between the spans of the frontend service, which means the bottleneck is not in the route service, but in how the frontend service calls it.

Let's look at function getRoutes() in the source file services/frontend/best_eta.go:

// getRoutes calls Route service for each (customer, driver) pair
func (eta *bestETA) getRoutes(
    ctx context.Context, 
    customer *customer.Customer, 
    drivers []driver.Driver,
) []routeResult {
    results := make([]routeResult, 0, len(drivers))
    wg := sync.WaitGroup{}
    routesLock := sync.Mutex{}
    for _, dd := range drivers {
        wg.Add(1)
        driver := dd // capture loop var
        // Use worker pool to (potentially) execute
        // requests in parallel
        eta.pool.Execute(func() {
            route, err := eta.route.FindRoute(
                ctx, driver.Location, customer.Location
            )
            routesLock.Lock()
            results = append(results, routeResult{
                driver: driver.DriverID,
                route:  route,
                err:    err,
            })
            routesLock.Unlock()
            wg.Done()
        })
    }
    wg.Wait()
    return results
}

This function receives a customer record (with address) and a list of drivers (with their current locations), then calculates the expected time of arrival (ETA) for each driver. It calls the route service for each driver inside an anonymous function executed via a pool of goroutines, by passing the function to eta.pool.Execute().

Since all functions are executed asynchronously, we track their completion with the wait group, wg, which implements a countdown latch: for every new function, we increment its count with we.Add(1), and then we block on wg.Wait() until each of the spawned functions calls wg.Done().

As long as we have enough executors (goroutines) in the pool, we should be able to run all calculations in parallel. The size of the executor pool is defined in services/config/config.go:

RouteWorkerPoolSize = 3

The default value of three explains why we saw, at most, three parallel executions in the very first trace we inspected. Let's change it to 100 (goroutines in Go are cheap) using the -W command line flag, and restart HotROD:

$ ./example-hotrod -M -D 100ms -W 100 all
INFO	Using expvar as metrics backend
INFO	fix: overriding MySQL query delay {"old": "300ms", "new": "100ms"}
INFO	fix: disabling db connection mutex
INFO	fix: overriding route worker pool size {"old": 3, "new": 100}
INFO	Starting all services

One more time, reload the HotROD web UI and repeat the experiment. We have to click on the buttons really quickly now because the requests return back in less than half a second.

Figure 2.22: Latency results after fixing the worker pool bottleneck. All requests return in less than half a second.

If we look at one of these new traces, we will see that, as expected, the calls from frontend to the route service are all done in parallel now, thus minimizing the overall request latency. We leave the final optimization of the driver service as an exercise for you.

Figure 2.23: Trace after fixing the worker pool bottleneck. The frontend is able to fire 10 requests to the route service almost simultaneously.

Resource usage attribution

In this last example, we will discuss a technique that is not, strictly speaking, a functionality commonly provided by distributed tracing systems, but rather a side effect of the distributed context propagation mechanism that underlies all tracing instrumentation and can be relied upon in applications instrumented for tracing. We saw an example of this earlier when we discussed an implicit propagation of the frontend request ID used to tag transactions that were blocking in a mutex queue. In this section, we will discuss the use of metadata propagation for resource usage attribution.

Resource usage attribution is an important function in large organizations, especially for capacity planning and efficiency improvements. We can define it as a process of measuring the usage of some resource, such as CPU cores, or disk space, and attributing it to some higher-level business parameter, such as a product or a business line. For example, consider a company that has two lines of business that together require 1,000 CPU cores to operate, as maybe measured by some cloud provider.

Let's assume the company projects that one line of business will grow 10% next year, while the other will grow 100%. Let's also assume, for simplicity, that the hardware needs are proportional to the size of each business. We are still not able to predict how much extra capacity the company will need because we do not know how those current 1,000 CPU cores are attributed to each business line.

If the first business line is actually responsible for consuming 90% of hardware, then its hardware needs will increase from 900 to 990 cores, and the second business line's needs will increase from 100 to 200 CPU cores, to the total of an extra 190 cores across both business lines. On the other hand, if the current needs of the business lines are split 50/50, then the total capacity requirement for next year will be 500 * 1.1 + 500 * 2.0=1550 cores.

The main difficulty in resource usage attribution stems from the fact that most technology companies use shared resources for running their business. Consider such products as Gmail and Google Docs. Somewhere, at the top level of the architecture, they may have dedicated pools of resources, for example, load balancers and web servers, but the lower we go down the architecture, the more shared resources we usually find.

At some point, the dedicated resource pools, like web servers, start accessing shared resources like Bigtable, Chubby, Google File System, and so on. It is often inefficient to partition those lower layers of architecture into distinct subsets in order to support multi-tenancy. If we require all requests to explicitly carry the tenant information as a parameter, for example, tenant="gmail" or tenant="docs", then we can accurately report resource usage by the business line. However, such a model is very rigid and hard to extend if we want to break down the attribution by a different dimension, as we need to change the APIs of every single infrastructure layer to pass that extra dimension. We will now discuss an alternative solution that relies on metadata propagation.

We have seen in the HotROD demo that the calculation of the shortest route performed by the route service is a relatively expensive operation (probably CPU intensive). It would be nice if we could calculate how much CPU time we spend per customer. However, the route service is an example of a shared infrastructure resource that is further down the architectural layers from the point where we know about the customer. It does not need to know about the customer in order to calculate the shortest route between two points.

Passing the customer ID to the route service just to measure the CPU usage would be poor API design. Instead, we can use the distributed metadata propagation built into the tracing instrumentation. In the context of a trace, we know for which customer the system is executing the request, and we can use metadata (baggage) to transparently pass that information throughout the architecture layers, without changing all the services to accept it explicitly.

If we want to perform CPU usage aggregation by some other dimension, say, the session ID from the JavaScript UI, we can do so without changing all the components as well.

To demonstrate this approach, the route service contains the code to attribute the CPU time of calculations to the customer and session IDs, which it reads from the baggage. In the services/route/server.go file, we can see this code:

func computeRoute(
    ctx context.Context, 
    pickup, dropoff string,
) *Route {
    start := time.Now()
    defer func() {
        updateCalcStats(ctx, time.Since(start))
    }()
    // actual calculation ...
}

As with the instrumented mutex we saw earlier, we don't pass any customer/session IDs because they can be retrieved from baggage via the context. The code actually uses some static configuration to know which baggage items to extract and how to report the metrics.

var routeCalcByCustomer = expvar.NewMap(
    "route.calc.by.customer.sec",
)
var routeCalcBySession = expvar.NewMap(
    "route.calc.by.session.sec",
)
var stats = []struct {
    expvar  *expvar.Map
    baggage string
}{
    {routeCalcByCustomer, "customer"},
    {routeCalcBySession, "session"},
}

This code uses the expvar package ("exposed variables") from Go's standard library. It provides a standard interface to global variables that can be used to accumulate statistics about the application, such as operation counters, and it exposes these variables in JSON format via an HTTP endpoint, /debug/vars.

The expvar variables can be standalone primitives, like float and string, or they can be grouped into named maps for more dynamic statistics. In the preceding code, we define two maps: one keyed by customer ID and another by session ID, and combine them in the stats structure (an array of anonymous structs) with the names of metadata attributes that contain the corresponding ID.

The updateCalcStats() function first converts the elapsed time of an operation into seconds as a float value, then iterates through the stats array and checks if the span metadata contains the desired key ("customer" or "session"). If the baggage item with that key is not empty, it increments the counter for that key by calling AddFloat() on the respective expvar.Map to aggregate the time spent on computing the route.

func updateCalcStats(ctx context.Context, delay time.Duration) {
    span := opentracing.SpanFromContext(ctx)
    if span == nil {
        return
    }
    delaySec := float64(delay/time.Millisecond) / 1000.0
    for _, s := range stats {
        key := span.BaggageItem(s.baggage)
        if key != "" {
            s.expvar.AddFloat(key, delaySec)
        }
    }
}

If we navigate to http://127.0.0.1:8083/debug/vars, which is one of the endpoints exposed by the HotROD application, we can see route.calc.by.* entries that give us the breakdown of time (in seconds) spent in calculation on behalf of each customer, and some UI sessions.

route.calc.by.customer.sec: {
    Amazing Coffee Roasters: 1.479,
    Japanese Deserts: 2.019,
    Rachel's Floral Designs: 5.938,
    Trom Chocolatier: 2.542
},
route.calc.by.session.sec: {
    0861: 9.448,
    6108: 2.530
},

This approach is very flexible. If necessary, this static definition of the stats array can be easily moved to a configuration file to make the reporting mechanism even more flexible. For example, if we wanted to aggregate data by another dimension, say the type of web browser (not that it would make a lot of sense), we would need to add one more entry to the configuration and make sure that the frontend service captures the browser type as a baggage item.

The key point is that we do not need to change anything in the rest of the services. In the HotROD demo, the frontend and route services are very close to each other, so if we had to change the API it would not be a major undertaking. However, in real-life situations, the service where we may want to calculate resource usage can be many layers down the stack, and changing the APIs of all the intermediate services, just to pass an extra resource usage aggregation dimension, is simply not feasible. By using distributed context propagation, we vastly minimize the number of changes needed. In Chapter 10, Distributed Context Propagation, we will discuss other uses of metadata propagation.

In a production environment, using the expvar module is not the best approach, since the data is stored individually in each service instance. However, our example has no hard dependency on the expvar mechanism. We could have easily used a real metrics API and had our resource usage statistics aggregated in a central metrics system like Prometheus.

Summary

This chapter introduced a demo application, HotROD, that is instrumented for distributed tracing, and by tracing that application with Jaeger, an open source distributed tracing system, demonstrated the following features common to most end-to-end tracing systems:

Distributed transaction monitoring: Jaeger records the execution of individual requests across the whole stack of microservices, and presents them as traces.
Performance and latency optimization: Traces provide very simple visual guides to performance issues in the application. The practitioners of distributed tracing often say that fixing the performance issue is the easy part, but finding it is the hard part.
Root cause analysis: The highly contextualized nature of the information presented in the traces allows for quickly narrowing down to the parts of the execution that are responsible for issues with the execution (for example, the timeouts when calling Redis, or the mutex queue blocking).
Service dependency analysis: Jaeger aggregates multiple traces and constructs a service graph that represents the architecture of the application.
Distributed context propagation: The underlying mechanism of metadata propagation for a given request enables not only tracing, but various other features, such as resource usage attribution.

In the next chapter, we will review more of the theoretical underpinnings of distributed tracing, the anatomy of a tracing solution, different implementation techniques that have been proposed in the industry and academia, and various trade-offs that implementors of tracing infrastructure need to keep in mind when making certain architectural decisions.

References

Yuri Shkuro, Evolving Distributed Tracing at Uber Engineering, Uber Engineering blog, February 2017: https://eng.uber.com/distributed-tracing/.
Natasha Woods, CNCF Hosts Jaeger, Cloud Native Computing Foundation blog, September 2017: https://www.cncf.io/blog/2017/09/13/cncf-hosts-jaeger/.
The OpenTracing Authors. Semantic Conventions, The OpenTracing Specification: https://github.com/opentracing/specification/blob/master/semantic_conventions.md.

Chapter 3. Distributed Tracing Fundamentals

Distributed tracing, also known as end-to-end or workflow-centric tracing, is a family of techniques that aim to capture the detailed execution of causally-related activities, performed by the components of a distributed system. Unlike the traditional code profilers or host-level tracing tools, such as dtrace [1], end-to-end tracing is primarily focused on profiling the individual executions cooperatively performed by many different processes, usually running on many different hosts, which is typical of modern, cloud-native, microservices-based applications.

In the previous chapter, we saw a tracing system in action from the end user perspective. In this chapter, we will discuss the basic underlying ideas of distributed tracing, various approaches that have been presented in the industry, academic works for implementing end-to-end tracing; the impact and trade-offs of the architectural decisions taken by different tracing systems on their capabilities, and the types of problems they can address.

The idea

Consider the following, vastly simplified architectural diagram of a hypothetical e-commerce website. Each node in the diagram represents numerous instances of the respective microservices, handling many concurrent requests. To help with understanding the behavior of this distributed system and its performance or user-visible latency, end-to-end tracing records information about all the work performed by the system on behalf of a given client or request initiator. We will refer to this work as execution or request throughout this book.

The data is collected by means of instrumentation trace points. For example, when the client is making a request to the web server, the client's code can be instrumented with two trace points: one for sending the request and another for receiving the response. The collected data for a given execution is collectively referred to as trace. One simple way to visualize a trace is via a Gantt chart, as shown on the right in in Figure 3.1:

Figure 3.1: Left: a simplified architecture of a hypothetical e-commerce website and inter-process communications involved in executing a single request from a client. Right: a visualization of a single request execution as a Gantt chart.

Request correlation

The basic concept of distributed tracing appears to be very straightforward:

Instrumentation is inserted into chosen points of the program's code (trace points) and produces profiling data when executed
The profiling data is collected in a central location, correlated to the specific execution (request), arranged in the causality order, and combined into a trace that can be visualized or further analyzed

Of course, things are rarely as simple as they appear. There are multiple design decisions taken by the existing tracing systems, affecting how these systems perform, how difficult they are to integrate into existing distributed applications, and even what kinds of problems they can or cannot help to solve.

The ability to collect and correlate profiling data for a given execution or request initiator, and identify causally-related activities, is arguably the most distinctive feature of distributed tracing, setting it apart from all other profiling and observability tools. Different classes of solutions have been proposed in the industry and academia to address the correlation problem. Here, we will discuss the three most common approaches: black-box inference, domain-specific schemas, and metadata propagation.

Black-box inference

Techniques that do not require modifying the monitored system are known as black-box monitoring. Several tracing infrastructures have been proposed that use statistical analysis or machine learning (for example, the Mystery Machine [2]) to infer causality and request correlation by consuming only the records of the events occurring in the programs, most often by reading their logs. These techniques are attractive because they do not require modifications to the traced applications, but they have difficulties attributing causality in the general case of highly concurrent and asynchronous executions, such as those observed in event-driven systems. Their reliance on "big data" processing also makes them more expensive and higher latency compared to the other methods.

Schema-based

Magpie [3] proposed a technique that relied on manually-written, application-specific event schemas that allowed it to extract causality relationships from the event logs of production systems. Similar to the black-box approach, this technique does not require the applications to be instrumented explicitly; however, it is less general, as each application requires its own schemas.

This approach is not particularly suitable for modern distributed systems that consist of hundreds of microservices because it would be difficult to scale the manual creation of event schemas. The schema-based technique requires all events to be collected before the causality inference can be applied, so it is less scalable than other methods that allow sampling.

Metadata propagation

What if the instrumentation trace points could annotate the data they produce with a global identifier – let's call it an execution identifier – that is unique for each traced request? Then the tracing infrastructure receiving the annotated profiling data could easily reconstruct the full execution of the request, by grouping the records by the execution identifier. So, how do the trace points know which request is being executed when they are invoked, especially trace points in different components of a distributed application? The global execution identifier needs to be passed along the execution flow. This is achieved via a process known as metadata propagation or distributed context propagation.

Figure 3.2: Propagating the execution identifier as request metadata. The first service in the architecture (client) creates a unique execution identifier (Request ID) and passes it to the next service via metadata/context. The remaining services keep passing it along in the same way.

Metadata propagation in a distributed system consists of two parts: in-process and inter-process propagation. In-process propagation is responsible for making the metadata available to trace points inside a given program. It needs to be able to carry the context between the inbound and outbound network calls, dealing with possible thread switches or asynchronous behavior, which are common in modern applications. Inter-process propagation is responsible for transferring metadata over network calls when components of a distributed system communicate to each other during the execution of a given request.

Inter-process propagation is typically done by decorating communication frameworks with special tracing middleware that encodes metadata in the network messages, for example, in HTTP headers, Kafka records headers, and so on.

Figure 3.3: Metadata propagation in a single service. (1) The Handler that processes the inbound request is wrapped into instrumentation that extracts metadata from the request and stores it in a Context object in memory. (2) Some in-process propagation mechanism, for example, based on thread-local variables. (3) Instrumentation wraps an RPC client and injects metadata into outbound (downstream) requests.

The key disadvantage of metadata propagation-based tracing is the expectation of a white-box system whose components can be modified accordingly. However, it is more scalable and provides much higher accuracy of the data compared to black-box techniques, since all trace points are explicitly annotating the data with execution identifiers. In many programming languages, it is even possible to inject trace points automatically, without changes to the application itself, through a technique known as agent-based instrumentation (we will discuss this in more detail in Chapter 6, Tracing Standards and Ecosystem). Distributed tracing based on metadata propagation is by far the most popular approach and used by virtually all industrial-grade tracing systems today, both commercial and open source. Throughout the rest of this book, we will focus exclusively on this type of tracing systems. In Chapter 6, Tracing Standards and Ecosystem, we will see how new industry initiatives, such as the OpenTracing project [11], aim to reduce the cost of white-box instrumentation and make distributed tracing a standard practice in the development of modern cloud-native applications.

An acute reader may have noticed that the notion of propagating metadata alongside request execution is not limited to only passing the execution identifier for tracing purposes. Metadata propagation can be thought of as a prerequisite for distributed tracing, or distributed tracing can be thought of as an application built on top of distributed context propagation. In Chapter 10, Distributed Context Propagation we will discuss a variety of other possible applications.

Black-box inference

Schema-based

Metadata propagation

Schema-based

Metadata propagation

Anatomy of distributed tracing

The following diagram shows a typical organization of distributed tracing systems, built around metadata propagation. The microservices or components of a distributed application are instrumented with trace points that observe the execution of a request. The trace points record causality and profiling information about the request and pass it to the tracing system through calls to a Tracing API, which may depend on the specific tracing backend or be vendor neutral, like the OpenTracing API [11] that we will discuss in Chapter 4, Instrumentation Basics with OpenTracing.

Figure 3.4: Anatomy of distributed tracing

Special trace points at the edges of the microservice, which we can call inject and extract trace points, are also responsible for encoding and decoding metadata for passing it across process boundaries. In certain cases, the inject/extract trace points are used even between libraries and components, for example, when a Python code is making a call to an extension written in C, which may not have direct access to the metadata represented in a Python data structure.

The Tracing API is implemented by a concrete tracing library that reports the collected data to the tracing backend, usually with some in-memory batching to reduce the communications overhead. Reporting is always done asynchronously in the background, off the critical path of the business requests. The tracing backend receives the tracing data, normalizes it to a common trace model representation, and puts it in a persistent trace storage. Because tracing data for a single request usually arrives from many different hosts, the trace storage is often organized to store individual pieces incrementally, indexed by the execution identifier. This allows for later reconstruction of the whole trace for the purpose of visualization, or additional processing through aggregations and data mining.

Sampling

Sampling affects which records produced by the trace points are captured by the tracing infrastructure. It is used to control the volume of data the tracing backend needs to store, as well as the performance overhead and impact on the applications from executing tracing instrumentation. We discuss sampling in detail in Chapter 8, All About Sampling.

Preserving causality

If we only pass the execution identifier as request metadata and tag tracing records with it, it is sufficient to reassemble that data into a single collection, but it is not sufficient to reconstruct the execution graph of causally-related activities. Tracing systems need to capture causality that allows assembling the data captured by the trace points in the correct sequence. Unfortunately, knowing which activities are truly causally-related is very difficult, even with very invasive instrumentation. Most tracing systems elect to preserve Lamport's happens-before relation [4], denoted as → and formally defined as the least strict partial order on events, such that:

If events a and b occur in the same process, then a → b if the occurrence of event a preceded the occurrence of event b
If event a is the sending of a message and event b is the reception of the message sent in event a, then a → b

The happens-before relation can be too indiscriminate if applied liberally: may have influenced is not the same as has influenced. The tracing infrastructures rely on the additional domain knowledge about the systems being traced, and the execution environment, to avoid capturing irrelevant causality. By threading the metadata along the individual executions, they establish the relationships between items with the same or related metadata (that is, metadata containing different trace point IDs by the same execution ID). The metadata can be static or dynamic throughout the execution.

Tracing infrastructures that use static metadata, such as a single unique execution identifier, throughout the life cycle of a request, must capture additional clues via trace points, in order to establish the happens-before relationships between the events. For example, if part of an execution is performed on a single thread, then using the local timestamps allows correct ordering of the events. Alternatively, in a client-server communication, the tracing system may infer that the sending of a network message by the client happens before the server receiving that message. Similar to black-box inference systems, this approach cannot always identify causality between events when additional clues are lost or not available from the instrumentation. It can, however, guarantee that all events for a given execution will be correctly identified.

Most of today's industrial-grade tracing infrastructures use dynamic metadata, which can be fixed-width or variable-width. For example, X-Trace [5], Dapper [6], and many similar tracing systems use fixed-width dynamic metadata, where, in addition to the execution identifier, they record a unique ID (for example, a random 64-bit value) of the event captured by the trace point. When the next trace point is executed, it stores the inbound event ID as part of its tracing data, and replaces it with its own ID.

In the following diagram, we see five trace points causally linked to a single execution. The metadata propagated after each trace point is a three-part tuple (execution ID, event ID, and parent ID). Each trace point stores the parent event ID from inbound metadata as part of its captured trace record. The fork at trace point b and join at trace point e illustrate how causal relationships forming a directed acyclic graph can be captured using this scheme.

Figure 3.5: Establishing causal relationships using dynamic, fixed-width metadata

Using fixed-width dynamic metadata, the tracing infrastructure can explicitly record happens-before relationships between trace events, which gives it an edge over the static metadata approach. However, it is also somewhat brittle if some of the trace records are lost because it will no longer be able to order the events in the order of causality.

Some tracing systems use a variation of the fixed-width approach by introducing the notion of a trace segment, which is represented by another unique ID that is constant within a single process, and only changes when metadata is sent over the network to another process. It reduces the brittleness slightly, by making the system more tolerant to the loss of trace records within a single process, in particular when the tracing infrastructure is proactively trying to reduce the volume of trace data to control the overhead, by keeping the trace points only at the edges of a process and discarding all the internal trace points.

When using end-to-end tracing on distributed systems, where profiling data loss is a constant factor, some tracing infrastructures, for example, Azure Application Insights, use variable-width dynamic metadata, which grows as the execution travels further down the call graph from the request origin.

The following diagram illustrates this approach, where each next event ID is generated by appending a sequence number to the previous event ID. When a fork happens at event 1, two distinct sequence numbers are used to represent parallel events 1.1 and 1.2. The benefit of this scheme is higher tolerance to data loss; for example, if the record for event 1.2 is lost, it is still possible to infer the happens-before relationship 1 → 1.2.1.

Figure 3.6: Establishing causal relationships using dynamic, variable-width metadata

Inter-request causality

Sambasivan and others [10] argue that another critical architectural decision that significantly affects the types of problems an end-to-end tracing infrastructure is able to address is the question of how it attributes latent work. For example, a request may write data to a memory buffer that is flushed to the disk at a later time, after the originating request has been completed. Such buffers are commonly implemented for performance reasons and, at the time of writing, the buffer may contain data produced by many different requests. The question is: who is responsible for the use of resources and the time spent by the system on writing the buffer out?

The work can be attributed to the last request that made the buffer full and caused the write (trigger-preserving attribution), or it can be attributed proportionally to all requests that produced the data into the buffer before the flush (submitter-preserving attribution). Trigger-preserving attribution is easier to implement because it does not require access to the instrumentation data about the earlier executions that affected the latent work.

However, it disproportionally penalizes the last request, especially if the tracing infrastructure is used for monitoring and attributing resource consumption. The submitter-preserving attribution is fair in that regard but requires that the profiling data for all previous executions is available when latent work happens. This can be quite expensive and does not work well with some forms of sampling usually applied by the tracing infrastructure (we will discuss sampling in chapter 8).

Inter-request causality

Trace models

In Figure 3.4, we saw a component called "Collection/Normalization." The purpose of this component is to receive tracing data from the trace points in the applications and convert it to some normalized trace model, before saving it in the trace storage. Aside from the usual architectural advantages of having a façade on top of the trace storage, the normalization is especially important when we are faced with the diversity of instrumentations. It is quite common for many production environments to be using numerous versions of instrumentation libraries, from very recent ones to some that are several years old. It is also common for those versions to capture trace data in very different formats and models, both physical and conceptual. The normalization layer acts as an equalizer and translates all those varieties into a single logical trace model, which can later be uniformly processed by the trace visualization and analysis tools. In this section, we will focus on two of the most popular conceptual trace models: event model and span model.

Event model

So far, we have discussed tracing instrumentation taking the form of trace points that record events when the request execution passes through them. An event represents a single point in time in the end-to-end execution. Assuming that we also record the happens-before relationships between these events, we intuitively arrive to the model of a trace as a directed acyclic graph, with nodes representing the events and edges representing the causality.

Some tracing systems (for example, X-Trace [5]) use such an event model as the final form of the traces they surface to the user. The diagram in Figure 3.7 illustrates an event graph observed from the execution of an RPC request/response by a client-server application. It includes events collected at different layers of the stack, from application-level events (for example, "client send" and "server receive") to events in the TCP/IP stack.

The graph contains multiple forks used to model request execution at different layers, and multiple joins where these logical parallel executions converge to higher-level layers. Many developers find the event model difficult to work with because it is too low level and obscures useful higher-level primitives. For example, it is natural for the developer of the client application to think of the RPC request as a single operation that has start (client sent) and end (client receive) events. However, in the event graph these two nodes are far apart.

Figure 3.7: Trace representation of an RPC request between client and server in the event model, with trace events recorded at application and TCP/IP layers

The next diagram (Figure 3.8) shows an even more extreme example, where a fairly simple workflow becomes hard to decipher when represented as an event graph. A frontend Spring application running on Tomcat is calling another application called remotesrv, which is running on JBoss. The remotesrv application is making two calls to a PostgreSQL database.

It is easy to notice that aside from the "info" events shown in boxes with rounded corners, all other records come in pairs of entry and exit events. The info events are interesting in that they look almost like a noise: they most likely contain useful information if we had to troubleshoot this particular workflow, but they do not add much to our understanding of the shape of the workflow itself. We can think of them as info logs, only captured via trace points. We also see an example of fork and join because the info event from tomcat-jbossclient happens in parallel with the execution happening in the remotesrv application.

Figure 3.8: Event model-based graph of an RPC request between a Spring application running on Tomcat and a remotesrv application running on JBoss, and talking to a PostgreSQL database. The boxes with rounded corners represent simple point-in-time "info" events.

Span model

Having observed that, as in the preceding example, most execution graphs include well-defined pairs of entry/exit events representing certain operations performed by the application, Sigelman and others [6], proposed a simplified trace model, which made the trace graphs much easier to understand. In Dapper [6], which was designed for Google's RPC-heavy architecture, the traces are represented as trees, where tree nodes are basic units of work referred to as spans. The edges in the tree, as usual, indicate causal relationships between a span and its parent span. Each span is a simple log of timestamped records, including its start and end time, a human-readable operation name, and zero or more intermediary application-specific annotations in the form of (timestamp, description) pairs, which are equivalent to the info events in the previous example.

Figure 3.9: Using the span model to represent the same RPC execution as in as in Figure 3.8. Left: the resulting trace as a tree of spans. Right: the same trace shown as a Gantt chart. The info events are no longer included as separate nodes in the graph; instead they are modeled as timestamped annotations in the spans, shown as pills in the Gantt chart.

Each span is assigned a unique ID (for example, a random 64-bit value), which is propagated via metadata along with the execution ID. When a new span is started, it records the ID of the previous span as its parent ID, thus capturing the causality. In the preceding example, the remote server represents its main operation in the span with ID=6. When it makes a call to the database, it starts another span with ID=7 and parent ID=6.

Dapper originally advocated for the model of multi-server spans, where a client application that makes an RPC call creates a new span ID, and passes it as part of the call, and the server that receives the RPC logs its events using the same span ID. Unlike the preceding figure, the multi-server span model resulted in fewer spans in the tree because each RPC call is represented by only one span, even though two services are involved in doing the work as part of that RPC. This multi-server span model was used by other tracing systems, such as Zipkin [7] (where spans were often called shared spans). It was later discovered that this model unnecessarily complicates the post-collection trace processing and analysis, so newer tracing systems like Jaeger [8] opted for a single-host span model, in which an RPC call is represented by two separate spans: one on the client and another on the server, with the client span being the parent.

The tree-like span model is easy to understand for the programmers, whether they are instrumenting their applications or retrieving the traces from the tracing system for analysis. Because each span has only one parent, the causality is represented with a simple call-stack type view of the computation that it is easy to implement and to reason about.

Effectively, traces in this model look like distributed stack traces, a concept very intuitive to all developers. This makes the span model for traces the most popular in the industry, supported by the majority of tracing infrastructures. Even tracing systems that collect instrumentations in the form of single point-in-time events (for example, Canopy [9]) go to the extra effort to convert trace events into something very similar to the span model. Canopy authors claim that "events are an inappropriate abstraction to expose to engineers adding instrumentation to systems," and propose another representation they call modeled trace, which describes the requests in terms of execution units, blocks, points, and edges.

The original span model introduced in Dapper was only able to represent executions as trees. It struggled to represent other execution models, such as queues, asynchronous executions, and multi-parent causality (forks and joins). Canopy works around that by allowing instrumentation to record edges for non-obvious causal relationships between points. The OpenTracing API, on the other hand, sticks with the classic, simpler span model but allows spans to contain multiple "references" to other spans, in order to support joins and asynchronous execution.

Event model

Figure 3.7: Trace representation of an RPC request between client and server in the event model, with trace events recorded at application and TCP/IP layers

Span model

Clock skew adjustment

Anyone working with distributed systems programming knows that there is no such thing as accurate time. Each computer has a hardware clock built in, but those clocks tend to drift, and even using synchronization protocols like NTP can only get the servers maybe within a millisecond of each other. Yet we have seen that end-to-end tracing instrumentation captures the timestamp with most tracing events. How can we trust those timestamps?

Clearly, we cannot trust the timestamps to be actually correct, but this is not what we often look for when we analyze distributed traces. It is more important that timestamps in the trace are correctly aligned relative to each other. When the timestamps are from the same process, such as the start of the server span and the extra info annotations in the following diagram, we can assume that their relative positions are correct. The timestamps from different processes on the same host are generally incomparable because even though they are not subject to the hardware clock skew, the accuracy of the timestamps depends on many other factors, such as what programming language is used for a given process and what time libraries it is using and how. The timestamps from different servers are definitely incomparable due to hardware clock drifts, but we can do something about that.

Figure 3.10: Clock skew adjustment. When we know the causality relationships between the events, such as "client-send must happen before server-receive", we can consistently adjust the timestamps for one of the two services, to make sure that the causality constraints are satisfied. The annotations within the span do not need to be adjusted, since we can assume their timestamps to be accurate relative to the beginning and end timestamps of the span.

Consider the client and server spans at the top diagram in Figure 3.10. Let's assume that we know from instrumentation that this was a blocking RPC request, that is, the server could not have received the request before the client sent it, and the client could not have received the response before the server finished the execution (this reasoning only works if the client span is longer than the server span, which is not always the case). These basic causality rules allow us to detect if the server span is misaligned on the timeline based on its reported timestamps, as we can see in the example. However, we don't know how much it is misaligned.

We can adjust the timestamps for all events originating from the server process by shifting it to the left until its start and end events fall within the time range of the larger client span, as shown at the bottom of the diagram. After this adjustment, we end up with two variables, Clock skew adjustment and , that are still unknown to us. If there are no more occurrences of client and server interaction in the given trace, and no additional causality information, we can make an arbitrary decision on how to set the variables, for example, by positioning the server span exactly in the middle of the client span:

The values of Clock skew adjustment and calculated this way provide us with an estimate of the time spent by RPC in network communication. We are making an arbitrary assumption that both request and response took roughly the same time to be transmitted over the network. In other cases, we may have additional causality information from the trace, for example the server may have called a database and then another node in the trace graph called the same database server. That gives us two sets of constraints on the possible clock skew adjustment of the database spans. For example, from the first parent we want to adjust the database span by -2.5 ms and from the second parent by -5.5 ms. Since it's the same database server, we only need one adjustment to its clock skew, and we can try to find the one that works for both calling nodes (maybe it's -3.5 ms), even though the child spans may not be exactly in the middle of the parent spans, as we have arbitrarily done in the preceding formula.

In general, we can walk the trace and aggregate a large number of constraints using this approach. Then we can solve them as a set of linear equations for a full set of clock skew adjustments and we can apply to the trace to align the spans.

In the end, the clock skew adjustment process is always heuristic, since we typically don't have other reliable signals to calculate it precisely. There are scenarios when this heuristic technique goes wrong and the resulting trace views make little sense to the users. Therefore, the tracing systems are advised to provide both adjusted and unadjusted views of the traces, as well as to clearly indicate when the adjustments are applied.

Trace analysis

Once the trace records are collected and normalized by the tracing infrastructure, they can be used for analysis, using visualizations or data mining algorithms. We will cover some of the data mining techniques in Chapter 12, Gathering Insights with Data Mining.

Tracing system implementers are always looking for new creative visualizations of the data, and end users often build their own views based on specific features they are looking for. Some of the most popular and easy-to-implement views include Gantt charts, service graphs, and request flow graphs.

We have seen examples of Gantt charts in this chapter. Gantt charts are mostly used to visualize individual traces. The x axis shows relative time, usually from the beginning of the request, and the y axis represents different layers and components of the architecture participating in the execution of the request. Gantt charts are good for analyzing the latency of the requests, as they easily show which spans in the trace take the longest time, and combined with critical path analysis can zoom in on problematic areas. The overall shape of the chart can reveal other performance problems at a glance, like the lack of parallelism among sub-requests or unexpected synchronization/blocking.

Service graphs are constructed from a large corpus of traces. Fan-outs from a node indicate calls to other components. This visualization can be used for analysis of service dependencies in large microservices-based applications. The edges can be decorated with additional information, such as the frequency of calls between two given components in the corpus of traces.

Request flow graphs represent the execution of individual requests, as we have seen in the examples in the section on the event model. When using the event model, the fan-outs in the flow graph represent parallel execution and fan-ins are joins in the execution. With the span model, the flow graphs can be shown differently; for example, fan-outs can simply represent the calls to other components similar to the service graph, rather than implying concurrency.

Summary

This chapter introduced the fundamental principles underlying most open source, commercial, and academic distributed tracing systems, and the anatomy of a typical implementation. Metadata propagation is the most-popular and frequently-implemented approach to correlating tracing records with a particular execution, and capturing causal relationships. Event model and span model are the two completing trace representations, trading expressiveness for ease of use.

We briefly mentioned a few visualization techniques, and more examples of visualization, and data mining use cases, will be discussed in subsequent chapters.

In the next chapter, we will go through an exercise to instrument a simple "Hello, World!" application for distributed tracing, using the OpenTracing API.

References

Bryan M. Cantrill, Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. Proceedings of the 2004 USENIX Annual Technical Conference, June 27-July 2, 2004.
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. October 6–8, 2014.
Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. Using Magpie for request extraction and workload modelling. OSDI '04: Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation, 2004.
Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21 (7), July1978.
Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-Trace: a pervasive network tracing framework. In NSDI '07: Proceedings of the 4^th USENIX Symposium on Networked Systems Design and Implementation, 2007.
Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed system tracing infrastructure. Technical Report dapper-2010-1, Google, April 2010.
Chris Aniszczyk. Distributed Systems Tracing with Zipkin. Twitter Engineering blog, June 2012: https://blog.twitter.com/engineering/en_us/a/2012/distributed-systems-tracing-with-zipkin.html.
Yuri Shkuro. Evolving Distributed Tracing at Uber Engineering. Uber Engineering blog, February 2017: https://eng.uber.com/distributed-tracing/.
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. Canopy: An End-to-End Performance Tracing and Analysis System. Symposium on Operating Systems Principles, October 2017.
Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, and Gregory R. Ganger. So, You Want To Trace Your Distributed System? Key Design Insights from Years of Practical Experience. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102, April 2014.
The OpenTracing Project: http://opentracing.io/.

Anonymous Reader Nov 21, 2019

The chapters range from concepts to practical, a must read for those into the topic.

Amazon Verified review

Richard Marques Oct 13, 2019

Very good book for everyone that wants to understand the problems that tracing will be solved using those technologies: open tracing and jaeger.

X Feb 17, 2021

This is the rare book that is more helpful than existing online documentation. I was tasked with implementing distributed tracing in a Go application; after several days of struggling to piece together an understanding of distributed tracing theory and how to use Jaeger and OpenTracing, I gave in and purchased this book. After reading the first few chapters, I was able to get a proof of concept implemented and understand it well enough to explain it to a coworker. It's an optimal combination of theory and application.While all of the information in this book can be found online; it was far more helpful to have it all in one place, organized by topic. I really appreciated the language-specific examples (fwiw, the languages included are Java, Python, and Go), as they provided a starting point that made it easier to understand the docs for the Go Jaeger library.For context, I am a junior level SRE at a FAANG; prior to this I was a full-stack developer for 3-4 years. I've worked with microservices for the past several years.

Mastering Distributed Tracing: Analyzing performance in microservices and complex systems

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Running from prepackaged binaries

Running from Docker images

Running from the source code

Go language development environment

Jaeger source code

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the authors

FAQs