Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
AWS Certified DevOps Engineer - Professional Certification and Beyond

You're reading from   AWS Certified DevOps Engineer - Professional Certification and Beyond Pass the DOP-C01 exam and prepare for the real world using case studies and real-life examples

Arrow left icon
Product type Paperback
Published in Nov 2021
Publisher Packt
ISBN-13 9781801074452
Length 638 pages
Edition 1st Edition
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Adam Book Adam Book
Author Profile Icon Adam Book
Adam Book
Arrow right icon
View More author details
Toc

Table of Contents (31) Chapters Close

Preface 1. Section 1: Establishing the Fundamentals
2. Chapter 1: Amazon Web Service Pillars FREE CHAPTER 3. Chapter 2: Fundamental AWS Services 4. Chapter 3: Identity and Access Management and Working with Secrets in AWS 5. Chapter 4: Amazon S3 Blob Storage 6. Chapter 5: Amazon DynamoDB 7. Section 2: Developing, Deploying, and Using Infrastructure as Code
8. Chapter 6: Understanding CI/CD and the SDLC 9. Chapter 7: Using CloudFormation Templates to Deploy Workloads 10. Chapter 8: Creating Workloads with CodeCommit and CodeBuild 11. Chapter 9: Deploying Workloads with CodeDeploy and CodePipeline 12. Chapter 10: Using AWS Opsworks to Manage and Deploy your Application Stack 13. Chapter 11: Using Elastic Beanstalk to Deploy your Application 14. Chapter 12: Lambda Deployments and Versioning 15. Chapter 13: Blue Green Deployments 16. Section 3: Monitoring and Logging Your Environment and Workloads
17. Chapter 14: CloudWatch and X-Ray's Role in DevOps 18. Chapter 15: CloudWatch Metrics and Amazon EventBridge 19. Chapter 16: Various Logs Generated (VPC Flow Logs, Load Balancer Logs, CloudTrail Logs) 20. Chapter 17: Advanced and Enterprise Logging Scenarios 21. Section 4: Enabling Highly Available Workloads, Fault Tolerance, and Implementing Standards and Policies
22. Chapter 18: Autoscaling and Lifecycle Hooks 23. Chapter 19: Protecting Data in Flight and at Rest 24. Chapter 20: Enforcing Standards and Compliance with System Manger's Role and AWS Config 25. Chapter 21: Using Amazon Inspector to Check your Environment 26. Chapter 22: Other Policy and Standards Services to Know 27. Section 5: Exam Tips and Tricks
28. Chapter 23: Overview of the DevOps Professional Certification Test 29. Chapter 24: Practice Exam 1 30. Other Books You May Enjoy

Operational excellence

As we look at the operational excellence pillar, especially in the context of DevOps, this is one – if not the most – important service pillar for your day-to-day responsibilities. We will start by thinking about how our teams are organized; after all, the DevOps movement came about from breaking down silos between Development and Operations teams.

Question – How does your team determine what its priorities are?

* Does it talk to customers (whether they're internal or external)?

* Does it get its direction from product owners who have drawn out a roadmap?

Amazon outlines five design principles that incorporate operational excellence in the cloud:

  • Performing Operations as Code
  • Refining operations frequently
  • Making small, frequent, and reversible changes
  • Anticipating failure
  • Learning from all operational failures

Let's take a look at each of these operational design principals in detail to see how they relate to your world as a DevOps engineer. As you go through the design principles of not only this pillar but all the service pillars, you will find that the best practices are spelled out, along with different services, to help you complete the objective.

Performing Operations as Code

With the contrivance of Infrastructure as Code, the cloud allows teams to create their applications using code alone, without the need to interact with a graphical interface. Moreover, it allows any the underlying networking, services, datastores, and more that's required to run your applications and workloads. Moving most, if not all, the operations to code does quite a few things for a team:

  • Distributes knowledge quickly and prevents only one person on the team from being able to perform an operation
  • Allows for a peer review of the environment to be conducted, along with quick iterations
  • Allows changes and improvements to be tested quickly, without the production environment being disrupted

In AWS, you can perform Operations as Code using a few different services, such as CloudFormation, the Cloud Development Kit (CDK), language-specific software development kits (SDK), or by using the command-line interface (CLI).

Refining operations frequently

As you run your workload in the cloud, you should be in a continual improvement process for not only your application and infrastructure but also your methods of operation. Teams that run in an agile process are familiar with having a retrospective meeting after each sprint to ask three questions: what went well, what didn't go well, and what has room for improvement?

Operating a workload in the cloud presents the same opportunities for retrospection and to ask those same three questions. It doesn't have to be after a sprint, but it should occur after events such as the following:

  • Automated, manual, or hybrid deployments
  • Automated, manual, or hybrid testing
  • After a production issue
  • Running a game day simulation

After each of these situations, you should be able to look at your current operational setup and see what could be better. If you have step-by-step runbooks that have been created for incidents or deployments, ask yourself and your team whether there were any missing steps or steps that are no longer needed. If you had a production issue, did you have the correct monitoring in place to troubleshoot that issue?

Making small, frequent, and reversible changes

As we build and move workloads into the cloud, instead of placing multiple systems on a single server, the best design practices are to break any large monolith designs into smaller, decoupled pieces. With the pieces being smaller, decoupled, and more manageable, you can work with smaller changes that are more reversible, should a problem arise.

The ability to reverse changes can also come in the form of good coding practices. AWS CodeCommit allows Git tags in code repositories. By tagging each release once it has been deployed, you can quickly redeploy a previous version of your working code, should a problem arise in the code base. Lambda has a similar feature called versions.

Anticipating failure

Don't expect that just because you are moving to the cloud and the service that your application is relying on is labeled as a managed service, that you no longer need to worry about failures. Failures happen, maybe not often; however, when running a business, any sort of downtime can translate into lost revenue. Having a plan to mitigate risks (and also test that plan) can genuinely mean the difference in keeping your service-level agreement (SLA) or having to apologize or, even worse, having to give customers credits or refunds.

Learning from failure

Things fail from time to time, but when they do, it's important not to dwell on the failures. Instead, perform post-mortem analysis and find the lessons that can make the team and the workloads stronger and more resilient for the future. Sharing learning across teams helps bring everyone's perspective into focus. One of the main questions that should be asked and answered after failure is, Could the issue be resolved with automatic remediation?

One of the significant issues in larger organizations today is that in their quest of trying to be great, they stop being good. Sometimes, you need to be good at the things you do, especially on a daily basis. It can be a steppingstone to greatness. However, the eternal quest for excellence without the retrospective of what is preventing you from becoming good can sometimes be an exercise in spinning your wheels, and not gaining traction.

Example – operational excellence

Let's take a look at the following relevant example, which shows the implementation of automated patching for the instances in an environment:

Figure 1.1 – Operational excellence – automated patching groups

Figure 1.1 – Operational excellence – automated patching groups

If you have instances in your environment that you are self-managing and need to be updated with patch updates, then you can use System Manager Patch Manager to help automate the task of keeping your operating systems up to date. This can be done on a regular basis using a Systems Manager Maintenance Task.

The initial step would be to make sure that the SSM agent (formally known as Simple Systems Manager) is installed on the machines that you want to stay up to date with patching.

Next, you would create a patching baseline, which includes rules for auto-approving patches within days of their release, as well as a list of both approved and rejected patches.

After that, you may need to modify the IAM role on the instance to make sure that the SSM service has the correct permissions.

Optionally, you can set up patch management groups. In the preceding diagram, we can see that we have two different types of servers, and they are both running on the same operating system. However, they are running different functions, so we would want to set up one patching group for the Linux servers and one group for the Database servers. The Database servers may only get critical patches, whereas the Linux servers may get the critical patches as well as the update patches.

You have been reading a chapter from
AWS Certified DevOps Engineer - Professional Certification and Beyond
Published in: Nov 2021
Publisher: Packt
ISBN-13: 9781801074452
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime