Chaos engineering
While it’s good to think through which combinations of systems and services you can disable and restart, there is always the possibility that you have missed something. For confidence in your application’s resilience, you can implement Chaos Monkey to automatically cause failures throughout your system.
This idea was made famous by Netflix, which introduced a program called Chaos Monkey to deliberately disable parts of their system. The best way to ensure you are resilient to any outage and can recover from it is by routinely doing it for real. Recall the options listed in the Classes of redundancy section:
Figure 11.7 – Different failover strategies at different application layers
The gateways represent systems that run in live-standby mode, with only a single server taking traffic, but a backup ready to come online. The core servers run live-live, with both taking traffic, and the database is also live-standby, but...