Preface
If you have worked on distributed applications, infrastructure, or client libraries, you’ve likely encountered numerous ways in which distributed systems can break.
For example, a default retry policy on your service can bring it down along with all its dependencies. Race conditions can lead to deadlocks under certain load or result in data leak between user accounts. User operations that usually take milliseconds can significantly slow down, while service dashboards show no signs of other issues. Functional problems can cause obscure and inexplicable effects on the user’s end.
When working with distributed applications, we rely on telemetry to assess their performance and functionality. We need even more telemetry to identify and mitigate issues.
In the past, we relied on custom logs and metrics collected using vendor-specific SDKs. We built custom parsers, processing pipelines, and reporting tools to make telemetry usable.
However, as applications have become more complex, we require better and more user-friendly approaches to understand what is happening in our systems. Personally, I find it unproductive to read through megabytes of logs or visually detect anomalies in metrics.
Distributed tracing is a technique that allows us to trace operations throughout the entire system. It provides correlation and causation to our telemetry, enabling us to retrieve all the relevant data describing a specific operation or find all operations based on the context, such as a requested resource or a user identifier.
Distributed tracing alone is not enough; we need other telemetry signals such as metrics, events, logs, and profiles, as well as libraries to collect and export them to observability backends. Fortunately, we have OpenTelemetry for this purpose. OpenTelemetry is a cloud-native, vendor-neutral telemetry platform available in multiple programming languages. It offers the core components necessary to collect custom data along with instrumentation libraries for common technologies. OpenTelemetry standardizes telemetry formats for different signals ensuring correlation, consistency, and structure in the collected data.
By leveraging consistent and structured telemetry, different observability vendors can provide tools such as service maps, trace visualizations, error classification, and detection of common properties contributing to failures. This essentially allows us to automate the error-prone and tedious parts of performance analysis that humans struggle with. Monitoring and debugging techniques can now become standardized practices across the industry, no longer relying on tribal knowledge, runbooks, or outdated documentation.
Modern Distributed Tracing in .NET explores all aspects of telemetry collection in .NET applications, with a focus on distributed tracing and performance analysis. It begins with an overview of the observability challenges and solutions and then delves into the built-in monitoring capabilities offered by modern .NET applications. These capabilities become even more impressive when used alongside OpenTelemetry. While shared OpenTelemetry instrumentation libraries can take us a long way, sometimes we still need to write custom instrumentations. The book shows how to collect custom traces, metrics, and logs while considering performance impact and verbosity. It also covers the instrumentation of common cloud patterns such as network calls, messaging, and database interactions. Finally, it discusses the organizational and technical aspects of implementing and evolving observability in existing systems.
The observability field is still relatively new and rapidly evolving, which means there are often multiple solutions available for almost any problem. This book aims to explain fundamental observability concepts and provides several possible solutions to common problems while highlighting the associated trade-offs. It also helps you gain practical skills to implement and leverage tracing and observability.
I hope you find the provided examples useful and use them as a playground for experimentation. I encourage you to explore new and creative approaches to making distributed systems more observable and to share your findings with the community!