A few years ago, I used to work in a company where the production rollouts steps were written in a Microsoft Word document command by command along with the explanation:
- Copy this file there: cp a.tar b.tar
- Restart the server xyz with the command: sudo service my-server restart
This was in addition to a long list of actions to take to release a new version. This happened because it was a fairly big company that had commoditized its IT department, and even though their business was based on an IT product, they did not embed IT in the core of their business.
As you can see, this is a very risky situation. Even though the developer who created the version and the deployment document was there, someone was deploying a new WAR (a Java web application packed in a file) in a production machine, following the instructions blindly. I remember asking one day: if this guy is executing the commands without questioning them, why don’t we just write a script that we run in production? It was too risky, they said.
They were right about it: risk is something that we want to reduce when deploying a new version of the software that is being used by some hundred thousand people on a single day. In fairness, risk is what pushed us to do the deployment at 4 A.M. instead of doing it during business hours.
The problem I see with this is that the way to mitigate the risks (deploy at 4 A.M in the morning when no one is buying our product) creates what we call, in IT, a single point of failure: the deployment is some sort of all or nothing event that is massively constrained by the time, as at 8 A.M., the traffic in the app usually went from two visits per hour to thousands per minute, around 9 A.M. being the busiest period of the day.
That said, there were two possible outcomes from the rollout: either the new software gets deployed or not. This causes stress to the people involved, and the last thing you want to have is stressed people playing with the systems of a multi-million business.
Let’s take a look at the maths behind a manual deployment, such as the one from earlier:
Description
|
Success Rate
|
Detach server 1 from the cluster
|
99.5%
|
Stop Tomcat on server 1
|
99.5%
|
Remove the old version of the app (the WAR file)
|
98%
|
Copy the new version of the app (the WAR file)
|
98%
|
Update properties in configuration files
|
95%
|
Start Tomcat
|
95%
|
Attach server 1 to the cluster
|
99.5%
|
This describes the steps involved in releasing a new version of the software in a single machine. The full company system had a few machines, so the process would have to be repeated a number of times, but let's keep it simple; assume that we are only rolling out to a single server.
Now a simple question: what is the overall failure rate in the process?
We naturally tend to think that the probability of a failure in a chained process such as the preceding list of instructions is the biggest in any step of the chain: 5%. That is not true. In fairness, it is a very dangerous, cognitive bias. We usually take very risky decisions due to the false perception of low risk.
Let's use the math to calculate the probability of failure:
The preceding list is a list of dependent events. We cannot execute step number 6 if step 4 failed, so the formula that we are going to apply is the following one:
P(T) = P(A1)*P(A2)…*P(An)
This leads to the following calculation:
P(T) = (99.5/100) * (99.5/100) * (98/100) * (98/100) * (95/100) * (95/100) * (99.5/100) = 0.8538
We are going to be successful only 85.38% of the times. This translated to deployments, which means that we are going to have problems 1 out of 6 times that we wake up at 4 A.M. to release a new version of our application, but there is a bigger problem: what if we have a bug that no one noticed during the production testing that happened just after the release? The answer to this question is simple and painful: the company would need to take down the full system to roll back to a previous version, which could lead to loss of revenue and customers.