Chapter 16: Designing for Chaos
Writing software that works in perfect conditions is easy. It would be nice if you never had to worry about network latency, service timeouts, storage outages, misbehaving applications, users sending bad arguments, security issues, or any of the real-life scenarios we find ourselves in.
In my experience, things tend to fail in the following three ways:
- Immediately
- Gradually
- Spectacularly
Immediately is usually the result of a change to application code that causes a service to die on startup or when receiving traffic to an endpoint. Most development test environments or canary rollouts catch these before any real problems occur in production. This type is generally trivial to fix and prevent.
Gradually is usually the result of some type of memory leak, thread/goroutine leak, or ignoring design limitations. These problems build up over time and begin causing problems that result in services crashing or growth in latency at...