Ensuring consistency and structure
As we already defined, spans are structured events describing interesting operations.
A span’s start time, duration, status, kind, and context are strongly typed – they enable correlation and causation, allowing us to visualize traces and detect failures or latency issues.
The span’s name and attributes describe an operation but are not strongly typed or strictly defined. If we don’t populate them in a meaningful way, we can detect an issue but have no knowledge of what actually happened.
For example, for client HTTP calls, beyond generic properties, we want to capture at least the URL, method, and response code (or exception) – if we don’t know any of these, we’re blind. Once we populate them, we can start doing some powerful analysis with queries over such spans to answer the following common questions:
- Which dependency calls were made in the scope of this request? Which of them failed? What was the latency of each of them?
- Does my application make independent dependency calls in parallel or sequentially? Does it make any unnecessary requests when they can be done lazily?
- Are dependency endpoints configured correctly?
- What are the success or error rates and latency per dependency API?
Note
This analysis relies on an application using the same attributes for all HTTP dependencies. Otherwise, the operator that performs the queries will have a hard time writing and maintaining them.
With unified and community-driven telemetry collection taken off the observability vendor’s plate, they can now fully focus on (semi-)automating analysis and giving us powerful performance and fault analysis tools.
OpenTelemetry defines a set of semantic conventions for spans, traces, and resources, which we’ll talk more about in Chapter 9, Best Practices.
Building application topology
Distributed tracing, combined with semantic conventions, allows us to build visualizations such as an application map (aka service map), as shown in Figure 1.11 – you could see your whole system along with key health metrics. It’s an entry point to any investigation.
Figure 1.11 – An Azure Monitor application map for a meme service is an up-to-date system diagram with all the basic health metrics
Observability vendors depend on trace and metrics semantics to build service maps. For example, the presence of HTTP attributes on the client span represents an outgoing HTTP call, and we need to show the outgoing arrow to a new dependency node. We should name this node based on the span’s host attribute.
If we see the corresponding server span, we can now merge the server node with the dependency node, based on span context and causation. There are other visualizations or automation tools that you might find useful – for example, critical path analysis, or finding common attributes that correspond to higher latency or error rates. Each of these relies on span properties and attributes following common semantics or at least being consistent across services.
Resource attributes
Resource attributes describe the process, host, service, and environment, and are the same for all spans reported by the service instance – for example, the service name, version, unique service instance ID, cloud provider account ID, region, availability zone, and K8s metadata.
These attributes allow us to detect anomalies specific to certain environments or instances – for example, an error rate increase only on instances that have a new version of code, an instance that goes into a restart loop, or a cloud service in a region and availability zone that experiences issues.
Based on standard attributes, observability vendors can write generic queries to perform this analysis or build common dashboards. It also enables the community to create vendor-agnostic tools and solutions for popular technologies.
Such attributes describe a service instance and don’t have to appear on every span – OTLP, for example, passes resource attributes once per batch of spans.