Our goal is within reach. We adopted schedulers (Docker Swarm in this case) that provide self-healing applied to services. We saw how Docker For AWS accomplishes a similar goal but on the infrastructure level. We used Prometheus, Alertmanager, and Jenkins to build a system that automatically adapts services to ever-changing conditions. The metrics we're storing in Prometheus are a combination of those gathered through exporters and those we added to our services through instrumentation. The only thing we're missing is self-adaptation applied to infrastructure. If we manage to build it, we'll close the circle and witness a self-sufficient system capable of running without (almost) any human intervention.
The logic behind self-adaptation applied to infrastructure is not much different from the one we used with services. We...