Operational Framework – Managing Infrastructure and Systems
There’s some confusion regarding the operational nature of site reliability engineering. For instance, we hear that site reliability engineers (SREs) exclusively work on automating toil or that they only manage the observability platforms that are available. Such statements cannot be true, as they defeat the very reason why we need SREs. SREs need to do operational work to handle system weaknesses, single points of failure, technical debt, performance issues, and risks. Furthermore, by getting to know them, they also fix these issues through operational work. Gene Brown, a distinguished engineer and global site reliability engineering leader at Kyndryl, once said that “SREs need to do operational work so they can get frustrated enough by the toil they face and automate such.”
The following diagram depicts the types of work SREs do on a daily basis. They undertake operational and engineering work...