You're reading from AWS Observability Handbook Monitor, trace, and alert your cloud applications with AWS' myriad observability tools

Product type Paperback

Published in Apr 2023

Publisher Packt

ISBN-13 9781804616710

Length 504 pages

Edition 1st Edition

Tools

AWS

Concepts

DevOps

Authors (2):

Fabio Oliveira

Phani Kumar Lingamallu

View More author details

Table of Contents (22) Chapters

Preface

1. Part 1: Getting Started with Observability on AWS

2. Chapter 1: Observability 101 FREE CHAPTER

3. Chapter 2: Overview of the Observability Landscape on AWS

4. Chapter 3: Gathering Operational Data and Alerting Using Amazon CloudWatch

5. Chapter 4: Implementing Distributed Tracing Using AWS X-Ray

6. Part 2: Automated and Machine Learning-Powered Observability on AWS

7. Chapter 5: Insights into Operational Data with CloudWatch

8. Chapter 6: Observability for Containerized Applications on AWS

9. Chapter 7: Observability for Serverless Applications on AWS

10. Chapter 8: End User Experience Monitoring on AWS

11. Part 3: Open Source Managed Services on AWS

12. Chapter 9: Collecting Metrics and Traces Using OpenTelemetry

13. Chapter 10: Deploying and Configuring an Amazon Managed Service for Prometheus

14. Chapter 11: Deploying the Elasticsearch, Logstash, and Kibana Stack Using Amazon OpenSearch Service

15. Part 4: Scaled Observability and Beyond

16. Chapter 12: Augmenting the Human Operator with Amazon DevOps Guru

17. Chapter 13: Observability Best Practices at Scale

18. Chapter 14: Be Well-Architected for Operational Excellence

19. Chapter 15: The Role of Observability in the Cloud Adoption Framework

20. Index

Why subscribe?

21. Other Books You May Enjoy

The need for observability in a distributed application environment

Let’s suppose you want to create the definitive Hello World program so that no other developer will need to implement it again. But you want to add a minor new feature: the users can give their names, and the application should remember them, all based on modern REST APIs. So, you implement something as follows:

from flask import Flask, request
import os.path
app = Flask(__name__)
@app.route("/")
def hello_world():
    name = request.args.get('name')
    if name:
        with open("name.txt", "w") as text_file:
             text_file.write(name)
    name_file = None
    if os.path.exists("name.txt"):
        with open("name.txt") as text_file:
             name_file = text_file.read()
    if name_file:
        return {
            "msg" : f"Hello, {name_file}!"
        }
    return {
        "msg": "Hello, World!"
    }

In this small example, written in the Python (https://www.python.org) language and using the Flask (https://flask.palletsprojects.com/en/2.0.x/) web framework, we have an optional name query parameter, which, if we receive it, we store in a file. Anyway, we always read from the file, and if there’s something in it, we return a friendly hello to our old, returning friend. Otherwise, we return an also friendly but generic Hello, World! message.

We can see an example of user interaction with our REST API here:

> curl http://127.0.0.1:5000/
{"msg":"Hello, World!"}
> curl http://127.0.0.1:5000/?name=User
{"msg":"Hello, User!"}
> curl http://127.0.0.1:5000/
{"msg":"Hello, User!"}

Our local tests show the implementation works as intended, so we are ready to shock and revolutionize the world. Our organization follows best practices, so we need to define and monitor key application metrics before we deploy our application in production. After years of deploying and monitoring applications, we, as software engineers, start to understand what can go wrong and what to keep an eye on. Usually, applications can be CPU-, memory-, or I/O-intensive. Given that our application writes and reads data to/from a file, we decided a key metric is input/output operations per second (IOPS). We add the necessary tools to monitor it and the CPU and memory just in case. We also create dashboards to have visual clues of our current state, and we implement alarms to notify us when we think we are reaching any system limits. This all looks good, so let’s open the gates for our beloved users!

But after a few users start to use our application, reports of unexpected behaviors begin to pour into our issue system. Some users sent their names, but the application failed to store them. Or even worse, some users received the names of other users in a significant data privacy leak. Nobody wants to be in the news because of that.

What happened to our perfect, simple, little application? During the deployment, our operations teams used a typical deployment pattern to increase the application’s scalability and availability, as shown in the following diagram:

Figure 1.1 – Load balancing requests to multiple servers

Many of you may recognize the pattern described in this diagram. For many years, even on-premises operations teams have deployed multiple nodes of the same application behind a load balancer, which distributes incoming requests in a round-robin fashion to all of them. In this way, you can quickly scale the number of requests the application can handle by the number of nodes, and if a node fails, the load balancer automatically redirects new requests to the yet-available nodes.

We look at our configured metrics and we are clueless. None of our metrics helps us solve the problem. We deploy new metrics. We watch the problem occur a couple of times again (with new, angry users). And after debugging a bit, we find that the users who could not see their names after sending them received responses from servers that did not have their names stored in the local storage. Even worse, the users receiving other users’ names received responses from servers that stored names from other users. What a mess!

Postmortem time: what happened, and how can we prevent it from happening again? When our operations team deployed our application behind a load balancer, we had multiple nodes, not just one anymore. New nodes could appear and disappear. This failure of nodes, combined with the fact we keep the application state in the individual nodes, causes the issue.

This is a simplistic, even silly, example of the jump in complexity from the local, single-user development environment to a distributed, multi-node, auto-scaling production environment. Our code is simple, and because of that, we thought nothing could go wrong. But there are many things outside our application code we don’t understand entirely. Still, we take them for granted: the CPU run queue, the kernel multi-threading, the language virtual machine, the network stack, the load balancing strategy… and many more. They all contain the application state and the potential root cause for an issue.

This simple example shows that an initially observable application, deployed as a standalone process, as many monoliths are, no longer remains observable as soon as we use modern techniques such as multiple nodes and load balancing. Those components added more complexity and issues we didn’t expect. As our user base grows and we split our monolithic application into many related services, what was the right observability tool before may not be the right tool now. This mismatch can catch us off guard because the complexity jump is exponential. As a terrifying example, see the following graph:

Figure 1.2 – Real-time graph of microservice dependencies at http://amazon.com in 2008

In our small example, we applied the usual techniques under the monitoring umbrella. The practice of monitoring is good enough for monolithic and small-scale distributed applications. And in this book, we will start with them, and we will progress, showing you the right tools for the job. With some experience, operations teams can reduce the potential failure space from hundreds, maybe thousands, of possibilities to a few. But we expect our businesses to grow, and with it, the supporting applications. The number of possible application and error states grows exponentially. As soon as our application reaches a specific size, at any moment, a call in the middle of the night can quickly become a sleepless night while we try to navigate the maze of our metrics to find the right set of inputs that have caused a new, unforeseen issue.

Modern applications have gotten good at accounting for failures that can be caught by tests and use established techniques such as autoscaling and failovers to make the application more resilient. As we catch up on known variables and take action to monitor them, the unknown unknowns are left. The issues we often see in modern applications are emergent failure modes, which happen when many unlikely events line up to degrade the performance of the system or even take it down. These scenarios are challenging to debug, which entails the need for observability.

If we want to understand any application state without deploying new code, we need to collect as much context as possible and store it all. We need mechanisms to query, slice, and summarize this data in new ways. Some of this complexity may not fit in our human brains anymore, so the support of machine learning tools is a must. Dashboards and alarms continue to be necessary for the well-known failure states, but to reach the next step, we need new tools in our tool belt.

So far, we have seen what observability is and how it evolved from more traditional monitoring practices to support more complex systems. We saw the need to collect more data and answer questions we didn’t know we should answer. In the next section, we will see the basic observability components and how they relate.

You're reading from AWS Observability Handbook Monitor, trace, and alert your cloud applications with AWS' myriad observability tools

Table of Contents (22) Chapters

The need for observability in a distributed application environment

Authors (2)

Personalised recommendations for you