Resiliency
Security is about preventing fraudulent activities, the theft of data, and other improper behavior that could lead to service disruptions. However, our application can go down or provide degraded service for several other reasons. This could be due to a traffic spike causing an overload, a software bug, or a hardware failure.
The core concept (sometimes underestimated) behind the resiliency of a system is the Service Level Agreement (SLA).
An SLA is an attempt to quantify (and usually enforce with a contract) some core metrics that our service should respect.
Uptime
The most widely used SLA is uptime, measuring the availability of the system. It is a basic metric, and it's commonly very meaningful for services providing essential components, such as connectivity or access to storage. However, if we consider more complex systems (such as an entire application, or a set of different applications, as in microservices architectures), it becomes more complex...