Eliminating toil
Site reliability engineering disciplines fill the systems management gaps left by the increased complexity of solutions in a hybrid multiple-cloud infrastructure environment. Complexity intrinsically hinders the scalability and reliability of systems by inserting unnecessary burdens in all operations. SREs were born to keep things simple by eliminating repetitive tasks, which is one of their fundamental purposes. To understand how SREs accomplish this mission, we’ll divide this section into three parts:
- Toil redefined
- Why toil is bad
- Handling toil the right way
Next, we’ll redefine what toil is in the site reliability engineering context.
Toil redefined
Google defines toil as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” For a long time, we used this definition to target...