Self-healing is a very important property of large-scale systems made up of a myriad of physical and virtual components. Microservice-based systems running on large Kubernetes clusters are a prime example. Components can fail in multiple ways. The premise of self-healing is that the overall system will not fail and will be able to automatically heal itself, even if this causes it to operate in a reduced capacity temporarily.
The building blocks of such reliable systems are as follows:
- Redundancy
- Observability
- Auto-recovery
The basic premise is that every component might fail – machines crash, disks get corrupted, network connections drop, configuration may get out of sync, new software releases have bugs, third-party services have outages, and so on. Redundancy means there are no single point of failures (SPOFs). You can run multiple replicas...