Operational excellence
As we look at the operational excellence pillar, especially in the context of DevOps, this is one – if not the most – important service pillar for your day-to-day responsibilities. We will start by thinking about how our teams are organized; after all, the DevOps movement came about from breaking down silos between Development and Operations teams.
Question – How does your team determine what its priorities are?
* Does it talk to customers (whether they're internal or external)?
* Does it get its direction from product owners who have drawn out a roadmap?
Amazon outlines five design principles that incorporate operational excellence in the cloud:
- Performing Operations as Code
- Refining operations frequently
- Making small, frequent, and reversible changes
- Anticipating failure
- Learning from all operational failures
Let's take a look at each of these operational design principals in detail to see how they relate to your world as a DevOps engineer. As you go through the design principles of not only this pillar but all the service pillars, you will find that the best practices are spelled out, along with different services, to help you complete the objective.
Performing Operations as Code
With the contrivance of Infrastructure as Code, the cloud allows teams to create their applications using code alone, without the need to interact with a graphical interface. Moreover, it allows any the underlying networking, services, datastores, and more that's required to run your applications and workloads. Moving most, if not all, the operations to code does quite a few things for a team:
- Distributes knowledge quickly and prevents only one person on the team from being able to perform an operation
- Allows for a peer review of the environment to be conducted, along with quick iterations
- Allows changes and improvements to be tested quickly, without the production environment being disrupted
In AWS, you can perform Operations as Code using a few different services, such as CloudFormation, the Cloud Development Kit (CDK), language-specific software development kits (SDK), or by using the command-line interface (CLI).
Refining operations frequently
As you run your workload in the cloud, you should be in a continual improvement process for not only your application and infrastructure but also your methods of operation. Teams that run in an agile process are familiar with having a retrospective meeting after each sprint to ask three questions: what went well, what didn't go well, and what has room for improvement?
Operating a workload in the cloud presents the same opportunities for retrospection and to ask those same three questions. It doesn't have to be after a sprint, but it should occur after events such as the following:
- Automated, manual, or hybrid deployments
- Automated, manual, or hybrid testing
- After a production issue
- Running a game day simulation
After each of these situations, you should be able to look at your current operational setup and see what could be better. If you have step-by-step runbooks that have been created for incidents or deployments, ask yourself and your team whether there were any missing steps or steps that are no longer needed. If you had a production issue, did you have the correct monitoring in place to troubleshoot that issue?
Making small, frequent, and reversible changes
As we build and move workloads into the cloud, instead of placing multiple systems on a single server, the best design practices are to break any large monolith designs into smaller, decoupled pieces. With the pieces being smaller, decoupled, and more manageable, you can work with smaller changes that are more reversible, should a problem arise.
The ability to reverse changes can also come in the form of good coding practices. AWS CodeCommit allows Git tags in code repositories. By tagging each release once it has been deployed, you can quickly redeploy a previous version of your working code, should a problem arise in the code base. Lambda has a similar feature called versions.
Anticipating failure
Don't expect that just because you are moving to the cloud and the service that your application is relying on is labeled as a managed service, that you no longer need to worry about failures. Failures happen, maybe not often; however, when running a business, any sort of downtime can translate into lost revenue. Having a plan to mitigate risks (and also test that plan) can genuinely mean the difference in keeping your service-level agreement (SLA) or having to apologize or, even worse, having to give customers credits or refunds.
Learning from failure
Things fail from time to time, but when they do, it's important not to dwell on the failures. Instead, perform post-mortem analysis and find the lessons that can make the team and the workloads stronger and more resilient for the future. Sharing learning across teams helps bring everyone's perspective into focus. One of the main questions that should be asked and answered after failure is, Could the issue be resolved with automatic remediation?
One of the significant issues in larger organizations today is that in their quest of trying to be great, they stop being good. Sometimes, you need to be good at the things you do, especially on a daily basis. It can be a steppingstone to greatness. However, the eternal quest for excellence without the retrospective of what is preventing you from becoming good can sometimes be an exercise in spinning your wheels, and not gaining traction.
Example – operational excellence
Let's take a look at the following relevant example, which shows the implementation of automated patching for the instances in an environment:
If you have instances in your environment that you are self-managing and need to be updated with patch updates, then you can use System Manager – Patch Manager to help automate the task of keeping your operating systems up to date. This can be done on a regular basis using a Systems Manager Maintenance Task.
The initial step would be to make sure that the SSM agent (formally known as Simple Systems Manager) is installed on the machines that you want to stay up to date with patching.
Next, you would create a patching baseline, which includes rules for auto-approving patches within days of their release, as well as a list of both approved and rejected patches.
After that, you may need to modify the IAM role on the instance to make sure that the SSM service has the correct permissions.
Optionally, you can set up patch management groups. In the preceding diagram, we can see that we have two different types of servers, and they are both running on the same operating system. However, they are running different functions, so we would want to set up one patching group for the Linux servers and one group for the Database servers. The Database servers may only get critical patches, whereas the Linux servers may get the critical patches as well as the update patches.