An overview of the daily activities of an SRE
Now that we have examined SRE responsibilities, it’s time to check what you, as an SRE, should be performing on a frequent basis. There’s no better way to understand a profession than by asking what someone does in it. When you go to a job interview, you probably want to know the activities a person in that position will carry out. SREs will have a list of assignments as sticky notes on their displays. We have separated those notable activities into two sections:
- Reactive work activities
- Proactive work activities
We’ll start by understanding reactive activities.
Reactive work activities
SREs execute many tasks that don’t lift (or shift) system reliability directly; they are usually operational types of work. Nevertheless, those activities either lessen the service downtime or mitigate risks. Examples of jobs that SREs perform daily in this category are as follows:
- Repair or restore a system or multiple services to their original state
- Follow and execute instructions from a runbook (standard operating procedure) during an incident to diagnose the application
- Implement a change request to apply a patch to a software component
- Attend a meeting to run a postmortem with system administrators and developers about the recent service or system outage
- Install a new Kubernetes cluster for a new application according to the development team’s specifications and enable monitoring of it
- Configure a new cloud-based service for a new application following the architecture design and include it in cloud monitoring
- Deploy a new software release to VMs and execute the testing scripts
Proactive work activities
SREs also carry out jobs that improve the quality, scalability, observability, manageability, resiliency, or availability of a system or service. Since those tasks increase the reliability levels of specific systems or services, they are considered proactive and mostly engineering type of work. Such assignments affect toil and technical debt. Examples of this category are as follows:
- Maintain a runbook on how to diagnose problems with a specific application
- Design and develop an automaton to execute procedures previously documented in a runbook automatically
- Establish, together with the DevOps team, the release strategy, such as a canary release, A/B testing, or blue-green deployment
- Work with the SWE to add management code to the application so SREs can instruct the application to do self-administration or self-healing operations
- Work with the development team to adopt an immutable infrastructure philosophy into the application-building process
- Instrument the application code to increase its observability with logs and traces
- Design and implement observability to obtain good metrics, events, logs, and traces from a critical application
Note
Site reliability engineers perform many more activities than the ones listed here. This is not a comprehensive list; the only intention is to show you how SREs work across multiple dimensions and aspects of systems and services.
We listed what an SRE does frequently. We wanted to give you a good sense of their day-to-day activities and how it differs from other roles. Again, this is not a complete or closed list. We want to close this chapter by telling you who our SRE rockstars are.