Improving the architecture
We need robust approaches to handle failed components if our architecture is to scale. But that'll only take us so far into the future—because handling the same failures over and over again doesn't scale. Eliminating the possibility of failure, where possible, does scale. Adding new components introduces new failure modes that we need to account for, and we need to offset these by eliminating old failure modes from the equation.
This is done through design; in particular, revised design. The change can be something minor, or it could be a radical shift in direction. It really depends on the frequency, the severity, and the rate of growth. Factor all these together, and we' come up with design trade-offs that enable us to move forward.
There are a number of techniques that can help get us there. For example, when we encounter new failure scenarios, we need a means to consistently document them, we need to better classify our components into critical versus non-critical...