You're reading from AWS Certified DevOps Engineer - Professional Certification and Beyond Pass the DOP-C01 exam and prepare for the real world using case studies and real-life examples

Product type Paperback

Published in Nov 2021

Publisher Packt

ISBN-13 9781801074452

Length 638 pages

Edition 1st Edition

Tools

AWS

Concepts

DevOps

Author (1):

Adam Book

View More author details

Table of Contents (31) Chapters

Preface

1. Section 1: Establishing the Fundamentals

2. Chapter 1: Amazon Web Service Pillars FREE CHAPTER

3. Chapter 2: Fundamental AWS Services

4. Chapter 3: Identity and Access Management and Working with Secrets in AWS

5. Chapter 4: Amazon S3 Blob Storage

6. Chapter 5: Amazon DynamoDB

7. Section 2: Developing, Deploying, and Using Infrastructure as Code

8. Chapter 6: Understanding CI/CD and the SDLC

9. Chapter 7: Using CloudFormation Templates to Deploy Workloads

10. Chapter 8: Creating Workloads with CodeCommit and CodeBuild

11. Chapter 9: Deploying Workloads with CodeDeploy and CodePipeline

12. Chapter 10: Using AWS Opsworks to Manage and Deploy your Application Stack

13. Chapter 11: Using Elastic Beanstalk to Deploy your Application

14. Chapter 12: Lambda Deployments and Versioning

15. Chapter 13: Blue Green Deployments

16. Section 3: Monitoring and Logging Your Environment and Workloads

17. Chapter 14: CloudWatch and X-Ray's Role in DevOps

18. Chapter 15: CloudWatch Metrics and Amazon EventBridge

19. Chapter 16: Various Logs Generated (VPC Flow Logs, Load Balancer Logs, CloudTrail Logs)

20. Chapter 17: Advanced and Enterprise Logging Scenarios

21. Section 4: Enabling Highly Available Workloads, Fault Tolerance, and Implementing Standards and Policies

22. Chapter 18: Autoscaling and Lifecycle Hooks

23. Chapter 19: Protecting Data in Flight and at Rest

24. Chapter 20: Enforcing Standards and Compliance with System Manger's Role and AWS Config

25. Chapter 21: Using Amazon Inspector to Check your Environment

26. Chapter 22: Other Policy and Standards Services to Know

27. Section 5: Exam Tips and Tricks

28. Chapter 23: Overview of the DevOps Professional Certification Test

29. Chapter 24: Practice Exam 1

30. Other Books You May Enjoy

Operational excellence

As we look at the operational excellence pillar, especially in the context of DevOps, this is one – if not the most – important service pillar for your day-to-day responsibilities. We will start by thinking about how our teams are organized; after all, the DevOps movement came about from breaking down silos between Development and Operations teams.

Question – How does your team determine what its priorities are?

* Does it talk to customers (whether they're internal or external)?

* Does it get its direction from product owners who have drawn out a roadmap?

Amazon outlines five design principles that incorporate operational excellence in the cloud:

Performing Operations as Code
Refining operations frequently
Making small, frequent, and reversible changes
Anticipating failure
Learning from all operational failures

Let's take a look at each of these operational design principals in detail to see how they relate to your world as a DevOps engineer. As you go through the design principles of not only this pillar but all the service pillars, you will find that the best practices are spelled out, along with different services, to help you complete the objective.

Performing Operations as Code

With the contrivance of Infrastructure as Code, the cloud allows teams to create their applications using code alone, without the need to interact with a graphical interface. Moreover, it allows any the underlying networking, services, datastores, and more that's required to run your applications and workloads. Moving most, if not all, the operations to code does quite a few things for a team:

Distributes knowledge quickly and prevents only one person on the team from being able to perform an operation
Allows for a peer review of the environment to be conducted, along with quick iterations
Allows changes and improvements to be tested quickly, without the production environment being disrupted

In AWS, you can perform Operations as Code using a few different services, such as CloudFormation, the Cloud Development Kit (CDK), language-specific software development kits (SDK), or by using the command-line interface (CLI).

Refining operations frequently

As you run your workload in the cloud, you should be in a continual improvement process for not only your application and infrastructure but also your methods of operation. Teams that run in an agile process are familiar with having a retrospective meeting after each sprint to ask three questions: what went well, what didn't go well, and what has room for improvement?

Operating a workload in the cloud presents the same opportunities for retrospection and to ask those same three questions. It doesn't have to be after a sprint, but it should occur after events such as the following:

Automated, manual, or hybrid deployments
Automated, manual, or hybrid testing
After a production issue
Running a game day simulation

After each of these situations, you should be able to look at your current operational setup and see what could be better. If you have step-by-step runbooks that have been created for incidents or deployments, ask yourself and your team whether there were any missing steps or steps that are no longer needed. If you had a production issue, did you have the correct monitoring in place to troubleshoot that issue?

Making small, frequent, and reversible changes

As we build and move workloads into the cloud, instead of placing multiple systems on a single server, the best design practices are to break any large monolith designs into smaller, decoupled pieces. With the pieces being smaller, decoupled, and more manageable, you can work with smaller changes that are more reversible, should a problem arise.

The ability to reverse changes can also come in the form of good coding practices. AWS CodeCommit allows Git tags in code repositories. By tagging each release once it has been deployed, you can quickly redeploy a previous version of your working code, should a problem arise in the code base. Lambda has a similar feature called versions.

Anticipating failure

Don't expect that just because you are moving to the cloud and the service that your application is relying on is labeled as a managed service, that you no longer need to worry about failures. Failures happen, maybe not often; however, when running a business, any sort of downtime can translate into lost revenue. Having a plan to mitigate risks (and also test that plan) can genuinely mean the difference in keeping your service-level agreement (SLA) or having to apologize or, even worse, having to give customers credits or refunds.

Learning from failure

Things fail from time to time, but when they do, it's important not to dwell on the failures. Instead, perform post-mortem analysis and find the lessons that can make the team and the workloads stronger and more resilient for the future. Sharing learning across teams helps bring everyone's perspective into focus. One of the main questions that should be asked and answered after failure is, Could the issue be resolved with automatic remediation?

One of the significant issues in larger organizations today is that in their quest of trying to be great, they stop being good. Sometimes, you need to be good at the things you do, especially on a daily basis. It can be a steppingstone to greatness. However, the eternal quest for excellence without the retrospective of what is preventing you from becoming good can sometimes be an exercise in spinning your wheels, and not gaining traction.

Example – operational excellence

Let's take a look at the following relevant example, which shows the implementation of automated patching for the instances in an environment:

Figure 1.1 – Operational excellence – automated patching groups

If you have instances in your environment that you are self-managing and need to be updated with patch updates, then you can use System Manager – Patch Manager to help automate the task of keeping your operating systems up to date. This can be done on a regular basis using a Systems Manager Maintenance Task.

The initial step would be to make sure that the SSM agent (formally known as Simple Systems Manager) is installed on the machines that you want to stay up to date with patching.

Next, you would create a patching baseline, which includes rules for auto-approving patches within days of their release, as well as a list of both approved and rejected patches.

After that, you may need to modify the IAM role on the instance to make sure that the SSM service has the correct permissions.

Optionally, you can set up patch management groups. In the preceding diagram, we can see that we have two different types of servers, and they are both running on the same operating system. However, they are running different functions, so we would want to set up one patching group for the Linux servers and one group for the Database servers. The Database servers may only get critical patches, whereas the Linux servers may get the critical patches as well as the update patches.