So, let us examine the various principle characteristics of a DevOps environment.
What follows is a series of generally accepted definitions, invariably mixed with personal opinions - you have been warned.
The alignment of effort toward increasing system performance and stability, reducing the time it takes to deploy or improving the overall quality of the product, will result in happier customers and proud engineers.
The goal needs to be repeated, clarified, and simplified until it is fully understood, argued against, and eventually committed to by everybody.
DevOps shifts focus away from self-interest and toward that goal. It directs praise at group achievements rather than those of the individual; KPIs and Employee of the Month initiatives perhaps not so much.
Allow people to look at the bigger picture past the realm of their cubicle. Trust them.
Shared knowledge (no silos)
The chances are you have already heard stories or read books about the notorious organizational silos.
In the worst case, it would be somebody who refuses to let go and often becomes the main bottleneck in a development life cycle. They can be fiercely territorial, safe-guarding what exclusive knowledge they might have in a given field, likely (I speculate) because this provides them with a sense of importance, further catering to their ego.
On the other hand, there are also examples of people who find themselves in a silo purely due to unfortunate circumstances. My respect goes out to the many engineers stuck with supporting inherited legacy systems all by themselves.
Fortunately, DevOps blurs such borders of expertise with concepts like cross-functional teams and full-stack engineers. It is important to note here that this does not translate into an opportunity to cut costs by expecting people to be tech ninja experts at every single thing (which in real life equates to preceding average). But, as in one of those Venn diagrams, it is the cross-over between a Dev and an Ops set of skills.
Silos are avoided by encouraging knowledge sharing. Peer reviews, demo stand-ups, or shared documentation are a few ways to ensure that no task or piece of know-how is limited to a specific person. You could even adopt Pair Programming. It seems a bit heavy, but it evidently works!
Trust and shared responsibility
Should developers be given production access?
There are good reasons for maintaining strict role-based permissions; one of them is security another is integrity. This standpoint remains valid for as long as we maintain the stereotype of the developer who is so used to working in devlocal; to them, concepts such as passphrase-protected SSH keys or not manually editing all of the files take a back seat.
In the era of DevOps, this is no longer the case. Shared knowledge and responsibility means operations engineers can rely on their developer colleagues to follow the same code of conduct when working in critical, production environments.
Dev and Ops teams have access to the same set of tools and environments. Deployments are no longer a special task reserved for the Ops team and scheduled days in advance.
In a team with such knowledge-sharing habits, I, as an operations engineer, can be confident about my Dev colleague's ability to perform my tasks, and vice versa.
Developers participate in the on-call rota, supporting the software they produce.
This is not to be seen as an additional burden, but as a sign of trust and an opportunity to increase collaboration. Nobody is throwing code over the wall anymore. Responsibility and a sense of autonomy motivates people to do more than is expected of them.
As we spend more time talking to each other about the challenges we face and the problems we are trying to solve, our mutual respect grows.
This manifests itself in developers seeking input from the Operations team from the early stages of the software development process or in Ops tools being built to meet developers' needs.
|
Ops who think like Devs. Devs who think like Ops
| |
| --John Allspaw and Paul Hammond, Velocity |
A DevOps environment is built on such respect. It is a place where every opinion matters, where people can and do openly question decisions in the interest of the best solution to a problem. This is a powerful indicator of one's commitment toward the common goal I mentioned earlier.
To draw an overly simplified conclusion from A. Maslow's Theory of Motivation, you are less likely to think about poetry when hungry. In other words, a team with basic needs will be solving basic problems.
Automating routine and mundane tasks allows engineers to concentrate on the more complex, higher-value ones. Also, people get bored, cut corners, and make mistakes – computers tend not to do so.
Reproducible infrastructure
Describing infrastructure as code has the following advantages:
- It can be kept under version control
- It is easily shared with others to re-use or reproduce
- It serves as a very useful diary of what you did and how exactly you did it
- Provisioning cloud environments becomes trivial (for example, with Terraform or CloudFormation)
- It makes modern Configuration Management possible
At any rate, I suspect anybody managing more than 10 servers is already codifying infrastructure in some way or another.
|
Measure All The Things!
| |
| --Actual DevOps slogan |
Storage is cheap. Develop the habit of gathering copious amounts of measurements and making those easily accessible across your organization. The more visibility engineers have into the performance of their infrastructure and applications, the more adequate their decisions will be in critical situations.
Graphs can convey a great deal of information, look rather cool on big screens, and the human mind has been proven excellent at recognizing patterns.
Another important role of metrics data is in performance optimization.
|
The trickiest part of speeding up a program is not doing it, but deciding whether it's worth doing at all...Part of the problem is that optimization is hard to do well. It's frighteningly easy to devolve into superstitious ritual and rationalization.
| |
| --Mature Optimization, Carlos Bueno |
To avoid falling prey to confirmation bias, you need an objective method of assessing your systems before and after attempting any optimization. Use those metrics; it is hard to argue with (valid) data.
On the subject of validity, please do calibrate your instruments regularly, sanity-check output and make sure what you think you are showing is what your colleagues think they are seeing (ref: https://mars.jpl.nasa.gov/msp98/news/mco990930.html).
Continuous Integration, Delivery, and Deployment
The Observe, Orient, Decide, and Act (OODA) loop is a concept developed by Col. J. Boyd that shows the value in one's ability to adapt to ever-changing circumstances.
Faced with unforgiving (and productive) competition, organizations should be able to rapidly react to dynamic market conditions.
This is probably best illustrated with the old Kodak and Netflix tales. The former after having been wildly successful is said to have failed to adapt to the new trends in their sector, causing the brand to gradually fade away. In contrast, Netflix keeps on skillfully molding their product to match the new ways in which we consume digital content. They completely transformed their infrastructure, shared some clever, new and somewhat controversial practices plus a ton of great software. Be like Netflix.
Continuous Integration and Delivery is essentially OODA in practice. Teams continuously integrate relatively small code changes, delivering releases more often, thus getting feedback from their users much quicker. The type of feedback needed by an organization to be able to adequately respond to an ever changing market.
None of the preceding suggests however that one should aim to become a release hero, rushing things into Production, setting it on fire twice a week. A CI/CDframework still implies the usual strict code review and test processes, despite how often you deploy. Though code reviews and testing require much less time and effort as typically the more frequent the deployments, the smaller the code changes.
Naturally, more experimentation is likely to increase the probability of error.
I doubt this fact comes as a surprise to anybody; what might surprise you, however, is the advice to accept an additional, positive angle to failure.
Recall the video nerds from the previous section. Well, they didn't exactly breeze through all that change without casualties. I hereby spare you the Edison quotes; however, trial and error is indeed a valid form of the scientific method, and the DevOps processes serve as a great enabler to those who would agree.
In other words, an organization should encourage people to keep on challenging and improving the current state of affairs while also allowing them to openly talk about the times when things went wrong.
But dealing with experimentation failures is possibly the more romantic side of the story compared to the cold, harsh reality of day-to-day operations.
Systems fail. I would like to think most of us have come to accept that fact along with the chain of thought it provokes:
- we do not always know as much as we think we do:
| "Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case... After an accident, practitioner actions may be regarded as errors or violations, but these evaluations are heavily biased by hindsight and ignore the other driving forces..." | |
| --How Complex Systems Fail, R.I. Cook |
Excelsior! Or how, in our long-standing pursuit of social dominance, we seem to have developed the convenient belief that following an event we not only know exactly what and how it happened but also why. This peculiar phenomenon has already been explained rather well by D. Kahneman in Thinking Fast and Slow; I will just add that indeed one often hears of overconfident characters who point fingers at their colleagues based on what appears to them as a coherent storyline.
The truth of the matter is this: we were not there. And keeping the details we now know and those known at the time separated is not an easy task.
- Blaming is of zero value:
Etsy and the likes in our community have shared enough observations to suggest that negative reinforcement as a strategy for reducing human error is less than optimal.
With the adoption of DevOps, we accept that people generally come to work every day with the intention to perform to the best of their abilities and in the interest of the organization. After an outage, we begin our analysis with the assumption that the operator has acted in the best possible way given the circumstances and information available to them at the time. We focus on what could have led to them making the given decisions, their thought process, the flow of events, and whether any of these can be improved.
- Resilience can be accumulated:
"What does not kill us..."mithridatism or Nassim Taleb's concept of antifragility are all expressive of the idea that we get better at dealing with negative experiences as we encounter them, and what's more, we should look for them every now and again.
We can train ourselves and our systems to recover from errors faster or even better to continue operating despite them. One way to achieve this is with controlled (and with practice, less controlled) outages.
With the right monitoring and auditing tools in place, every abnormal activity offers us a more intimate view of our applications and infrastructure.
Now that I have bestowed upon you, my dear reader, the secret to a better life through DevOps, let us concern ourselves with the latter part of the title of this chapter.