Failing forward fast
Observability is a multi-faceted topic. It can mean different things to different teams. Traditionally, observability has focused on monitoring infrastructure. However, with our serverless systems, we are delegating that responsibility to the cloud provider. This means that we can use observability at a higher order and apply it to help us drive innovation and deliver business value. In other words, observability can help us drive down lead time and move faster.
In Chapter 1, Architecting for Innovation, we dissected lead time so that we can understand what causes it to increase. For example, teams will naturally put on the breaks and slow down when they fear that a change could inadvertently break another part of the system. This is why we build bulkheads throughout our systems.
Teams will also slow down when they do not have enough information about the health and performance of the system. No process is perfect. We cannot eliminate honest human error. Teams will make mistakes. Therefore, when a team pushes a change to production, they need to be confident that they can find and fix problems fast. Otherwise, they will be apprehensive about moving forward and innovating.
This is where observability comes into play at the application level. We need to put working software in the end users’ hands in production so that we can get the feedback we need to discover the real requirements. If a deployment has an unintended side-effect, then we want the system to alert the team, preferably before the end user does. Then we can jump into action, at a moment’s notice, and use the detailed observability information to find the root cause and minimize the mean time to recovery (MTTR). In other words, observability helps us fail forward fast.
Now, let’s see how serverless makes observability easier.