Reviewing the history of observability
In many ways, being able to understand what a computer is doing is both fun and challenging when working with software. The ability to understand how systems are behaving has gone through quite a few iterations since the early 2000s. Many different markets have been created to solve this need, such as systems monitoring, log management, and application performance monitoring. As is often the case, when new challenges come knocking, the doors of opportunity open to those willing to tackle those challenges. Over the same period, countless vendors and open source projects have sprung up to help people who are building and operating services in managing their systems. The term observability, however, is a recent addition to the software industry and comes from control theory.
Wikipedia (https://en.wikipedia.org/wiki/Observability) defines observability as:
Observability is an evolution of its predecessors, built on lessons learned through years of experience and trial and error. To better understand where observability is today, it's important to understand where some of the methods used today by cloud-native application developers come from, and how they have changed over time. We'll start by looking at the following:
- Centralized logging
- Metrics and dashboards
- Tracing and analysis
Centralized logging
One of the first pieces of software a programmer writes when learning a new language is a form of observability: "Hello, World!". Printing some text to the terminal is usually one of the quickest ways to provide users with feedback that things are working, and that's why "Hello, World" has been a tradition in computing since the late 1960s.
One of my favorite methods for debugging is still to add print statements across the code when things aren't working. I've even used this method to troubleshoot an application distributed across multiple servers before, although I can't say it was my proudest moment, as it caused one of our services to go down temporarily because of a typo in an unfamiliar editor. Print statements are great for simple debugging, but unfortunately, this only scales so far.
Once an application is large enough or distributed across enough systems, searching through the logs on individual machines is not practical. Applications can also run on ephemeral machines that may no longer be present when we need those logs. Combined, all of this created a need to make the logs available in a central location for persistent storage and searchability, and thus centralized logging was born.
There are many available vendors that provide a destination for logs, as well as features around searching, and alerting based on those logs. There are also many open source projects that have tried to tackle the challenges of standardizing log formats, providing mechanisms for transport, and storing the logs. The following are some of these projects:
- Fluentd – https://www.fluentd.org
- Logstash – https://github.com/elastic/logstash
- Apache Flume – https://flume.apache.org
Centralized logging additionally provides the opportunity to produce metrics about the data across the entire system.
Using metrics and dashboards
Metrics are possibly the most well-known of the tools available in the observability space. Think of the temperature in a thermometer, the speed on the odometer of a car, or the time on a watch. We humans love measuring and quantifying things. From the early days of computing, being able to keep track of how resources were utilized was critical in ensuring that multi-user environments provided a good user experience for all users of the system.
Nowadays, measuring application and system performance via the collection of metrics is common practice in software development. This data is converted into graphs to generate meaningful visualizations for those in charge of monitoring the health of a system.
These metrics can also be used to configure alerting when certain thresholds have been reached, such as when an error rate becomes greater than an acceptable percentage. In certain environments, metrics are used to automate workflows as a reaction to changes in the system, such as increasing the number of application instances or rolling back a bad deployment. As with logging, over time, many vendors and projects provided their own solutions to metrics, dashboards, monitoring, and alerting. Some of the open source projects that focus on metrics are as follows:
- Prometheus – https://prometheus.io
- StatsD – https://github.com/statsd/statsd
- Graphite – https://graphiteapp.org
- Grafana – https://github.com/grafana/grafana
Let's now look at tracing and analysis.
Applying tracing and analysis
Tracing applications means having the ability to run through the application code and ensure it's doing what is expected. This can often, but not always, be achieved in development using a debugger such as GDB (https://www.gnu.org/software/gdb/) or PDB (https://docs.python.org/3/library/pdb.html) in Python. This becomes impossible when debugging an application that is spread across multiple services on different hosts across a network. Researchers at Google published a white paper on a large-scale distributed tracing system built internally: Dapper (https://research.google/pubs/pub36356/). In this paper, they describe the challenges of distributed systems, as well as the approach that was taken to address the problem. This research is the basis of distributed tracing as it exists today. After the paper was published, several open source projects sprung up to provide users with the tools to trace and visualize applications using distributed tracing:
- OpenTracing – https://opentracing.io
- OpenCensus – https://opencensus.io
- Zipkin – https://zipkin.io
- Jaeger – https://www.jaegertracing.io
As you can imagine, with so many tools, it can be daunting to even know where to begin on the journey to making a system observable. Users and organizations must spend time and effort upfront to even get started. This can be challenging when other deadlines are looming. Not only that, but the time investment needed to instrument an application can be significant depending on the complexity of the application, and the return on that investment sometimes isn't made clear until much later. The time and money invested, as well as the expertise required, can make it difficult to change from one tool to another if the initial implementation no longer fits your needs as the system evolves.
Such a wide array of methods, tools, libraries, and standards has also caused fragmentation in the industry and the open source community. This has led to libraries supporting one format or another. This leaves it up to the user to fix any gaps within the environments themselves. This also means there is effort required to maintain feature parity across different projects. All of this could be addressed by bringing the people working in these communities together.
With a better understanding of different tools at the disposal of application developers, their evolution, and their role, we can start to better appreciate the scope of what OpenTelemetry is trying to solve.