Observability in a nutshell
The term observability is borrowed from control theory. It’s common to use the term interchangeably with the term monitoring in IT systems, as the concepts are closely related. Monitoring is the ability to raise an alarm when something is wrong, while observability is the ability to understand a system and determine whether something is wrong, and why.
Control theory was formalized in the late 1800s on the topic of centrifugal governors in steam engines. This diagram shows a simplified view of such a system:
Figure 1.1 – James Watt’s steam engine flyweight governor (source: https://www.mpoweruk.com)
Steam engines use a boiler to boil water in a pressure vessel. The steam pushes a piston backward and forward, which converts heat energy to reciprocal energy. In steam engines that use centrifugal governors, this reciprocal energy is converted to rotational energy via a wheel connected to a piston. The centrifugal governor provides a physical link backward through the system to the throttle. This means that the speed of rotation controls the throttle, which, in turn, controls the speed of rotation. Physically, this is observed by the balls on the governor flying outward and dropping inward until the system reaches equilibrium.
Monitoring defines the metrics or events that are of interest in advance. For instance, the governor measures the pre-defined metric of drive shaft revolutions. The controllability of the throttle is then provided by the pivot and actuator rod assembly. Assuming the actuator rod is adjusted correctly, the governor should control the throttle from fully open to fully closed.
In contrast, observability is achieved by allowing the internal state of the system to be inferred from its external outputs. If the operating point adjustment is incorrectly set, the governor may spin too fast or too slowly, rendering the throttle control ineffective. A governor spinning too fast or too slowly could also indicate that the sliding ring is stuck in place and needs oiling. Importantly, this insight can be gained without defining in advance what too fast or too slow means. The insight that the governor is spinning too fast or too slowly also needs very little knowledge of the full steam engine.
Fundamentally, both monitoring and observability are used to improve the reliability and performance of the system in question.
Now that we have introduced the high-level concepts, let’s explore a practical example outside of the world of software services.
Case study – A ship passing through the Panama Canal
Let’s imagine a ship traversing the Agua Clara locks on the Panama Canal. This can be illustrated using the following figure:
Figure 1.2 – The Agua Clara locks on the Panama Canal
There are a few aspects of these locks that we might want to monitor:
- The successful opening and closing of each gate
- The water level inside each lock
- How long it takes for a ship to traverse the locks
Monitoring these aspects may highlight situations that we need to be alerted about:
- A gate is stuck open because of a mechanical failure
- The water level is rapidly descending because of a leak
- A ship is taking too long to exit the locks because it is stuck
There may be situations where the data we are monitoring are within acceptable limits, but we can still observe a deviation from what is considered normal, which should prompt further action:
- A small leak has formed near the top of the lock wall:
- We would see the water level drop but only when it is above the leak
- This could prompt maintenance work on the lock wall
- A gate in one lock is opening more slowly because it needs maintenance:
- We would see the time between opening and closing the gate increase
- This could prompt maintenance on the lock gate
- Ships take longer to traverse the locks when the wind is coming from a particular direction:
- We could compare hourly average traversal rates
- This could prompt work to reduce the impact of wind from one direction
Now that we’ve seen an example of measuring a real-world system, we can group these types of measurements into different data types to best suit the application. Let’s introduce those now.