Mixing good and bad – tricks to wrapping bad code and making it resilient
As a rockstar SRE, I’ll be the first to admit, I’ve put my fair share of duct tape in place to hold production systems up. You may think the job of SRE means we build out perfection, only do what is absolutely right or put together solutions that fix the root cause of issues – but you would be wrong. Our job, first and foremost, is to reduce revenue impact and protect the customer experience.
Alerting that fires actions
One of the simplest ways to provide corrective action is to build alerts that find the issue – then call scripts or actions to remediate it. The best example of this is actually built into most container base orchestrations, including Kubernetes. The infamous liveliness check Kubernetes makes to a container simply kills the container and spins up a new one when it fails. In short, if it doesn’t respond when you poke it with a stick, it’s dead...