In the meantime, start breaking things. Not in your user-facing environment. We have the ability to spin up copies of our service landscape using Terraform. By experimenting with failure in a low-risk place, it is possible to pinpoint potential deficiencies while minimizing blast radius. Netflix pioneered the emerging science of chaos engineering on AWS—more information can be found on the Chaos Monkey GitHub page (https://netflix.github.io/chaosmonkey/).
Fault injection
Reliability testing
In our earlier work with Terraform, we have seen how it requires a plan before an apply. This test-driven approach can be applied to our services too. Turn off an instance. No problem because our load balancer shifts all traffic...