10 p.m. on Friday was when everything started falling apart. The Q2 release party started a few hours ago with the entire operations team and a few members of the infrastructure and network team in attendance. Routing customer traffic away from the initial test server to allow for the upgrade went as expected. This process was recently automated through some network management scripts that the systems administration and network team worked on. The idea was that all new traffic should be routed away from the initial server while allowing the customer sessions that were currently using the server to continue until they disconnected. After all the user sessions were completed, the server was removed from the load balancer and the release process could start.
The infrastructure team had a bootstrap script already built out to automatically configure the server. Sometimes this process involved tearing down the whole server and rebuilding it, while other times the release required some simple software updates to be completed before the hardware was ready for the application release. The new release wouldn't require an entire rebuild of the server this time. However, since the last release was 3 months ago, they did have to patch the server, add a new application stack version, and make sure that other configuration requirements were set accordingly. The entirety of the infrastructure process took about an hour for the first server, which would then be repeated for the other servers so that the bootstrapping time would be reduced for the rest of the fleet. As more customers were acquired, the total number of servers in production had grown. To avoid downtime for the production environment these servers were grouped together into pools, which could then be individually targeted for stopping, upgrading, and restarting as needed.
After the initial server was bootstrapped by the infrastructure team and validated through some basic quality and security tests on the server, the operations team would then start the application release process. It was just after 7 p.m. when the operations team started the release process, also known as a deployment, by copying the ZIP file from the production network share to the server. The file was then expanded, into a mess of files and folders which contained system services, application files, and a rather daunting INSTALL_README.txt
file. This README file detailed all of the required install steps and validation checks that the engineering team documented for the operations team to execute.
With the install instruction file open on one screen and the terminal open on another, the install process could start. That is when everything went wrong.
Although the deployment testing in the staging environment had some issues because of missing requirements, those were documented and added to the install process. But what the operations team didn't know was that the server bootstrapping script had reset all of the network configuration files and all of the aplication traffic heading out of the server was being redirected back to itself. As the deployment went through, the application ZIP file was able to get pushed to the server, the filesystem was set up as needed, and the required system services began running. The script used to test the health of the application showed all successful log messages. However, when the script to test the interaction between the application and the database was run, the terminal output showed only connection errors. It took the team over an hour to get everything copied over, stood up, and tested before the network errors were discovered. The release party had come to a grinding halt.
The operations team was in full-on panic mode and the first RCA process had started. If they could not figure out why the server was not able to talk to external machines within the next hour, they would need to tear down the whole server and start over again. While one person from the operations team collaborated with the network and infrastructure team, another operations team member would retrace every action taken since the infrastructure team had finished their tasks. After 30 minutes of the network team analyzing all traffic related to the new server and the desired databases, they could not find any reason as to why the server could not reach the database. The Infrastructure team was checking to see if the server had been properly added to the domain and that no other machines were using the same hostname or IP address. The operations team had engaged the on-call engineering team and started a troubleshooting conference bridge for the data center support team to join.
It wasn't until a few minutes after midnight that the network team found the networking loopback issue on the server. The outcome of the RCA process found that the server bootstrapping script was the culprit, which was then altered to avoid the issue in the future. The server was now passing all health checks and the operations team could move on to the next server in the pool. Within an hour, the rest of the server pool had been fully upgraded without an issue. Almost two hours later, all server pools were upgraded and reporting healthy. The post-mortem process could begin now that the new application version was out in production and operating as expected.
A release party would always start with an initial test release into production, known as a deployment. At a high level, a deployment process solely concerned with is copying an artifact from a designated location to some endpoint or host. In the case of the quarterly release party, a deployment consisted of pushing or pulling the artifact to a designated test server in the production fleet. This was a common method used to avoid production downtime by preventing unknown production-specific nuances from negatively affecting an environment-wide deployment. The log output and application metrics for the test deployment were heavily scrutinized in an attempt to catch any hint of an issue.
The deployment on the test production server would typically require some bootstrapping process to enforce a common starting point for all future deployments. If all deployments started with the same server configuration starting point, then theoretically, nothing should be different as the deployments moved from one server to the next.
Once the deployment was complete, before traffic would be allowed on the application, a set of tests would be executed.These health checks would validate many different requirements, such as external service connectivity and buisness-critical functionality. After that initial production server was completed and the validation tests had passed, the rest of the initial server pool would be put through the same process. After the initial server pool was upgraded, it would be added to a load balancer with a set of load and smoke tests validating that the application, servers, and networking were operational.
Finally, after that initial server pool was completed, production traffic would then be routed appropriately. The next server pool in the queue would then follow the same process to test for deployment consistency. Once the release team had confidence in the process, they could upgrade multiple server pools in.
Even with a release party as intense as the one described happening on a quarterly basis, it was not the only release party for the engineering teams. Two other major deployment events would take place throughout the quarter: one for deployments of patches and hotfixes across the different architecture layers, , and the other for all non-production environments.
The first deployment that would happen throughout the quarter was when the engineering teams would need to test their code. The testing of the code started by packaging the code into an artifact and uploading it to a network share. Once the artifact was on the network share, the engineering team could then deploy their artifact to a designated server in the development environment. Usually, the engineer would have to run a bootstrapping script to reset the server to a desired state.
The process of deploying an artifact to a development environment had to be relatively repeatable , since the deployment frequency was either daily or weekly.. When the engineers believed that they had an artifact ready to be released, they would then deploy it to a test server in the Quality Assurance (QA) environment for further evaluation.
Along with all of the QA processes and testsneeding to be run, the team would need to start building out the INSTALL_README.txt
file for the production release. The QA team would send testing feedback to the developers for any required fixes or improvements. After a few rounds of feedback between the teams, the most recent artifact version would be promoted to a release candidate. The teams would then focus on the deployment process for the next release party. The handoff of the artifact from the developers to the operations team would happen about a month before the release party. Often described as "throwing it over the wall", the developers would have little to no interaction with the artifact once it was passed on the operations. The operations team would then spend the next month practicing the deployment for the release party.
The other major deployment events taking place throughout the quarter were the patching and hotfix releases.
Similar to the development deployment process, the patching deployment process would be executed against lower-level environments for both testing and repeatability. The major difference was that these deployments would take place outside of typical maintenance windows.
The initial set of releases would start with the development environment, allowing for significant testing to take place. This would prevent regressions from affecting higher-level environments, such as QA or production. Once the deployments to the lower-level environments were repeatable, the teams would designate an evening or weekendfor the deployment to take place. Similar to the release party, one server would be removed, patched, restarted checked and then made available for users. Assuming everything behaved as expected with the patched application, the rest of the servers would be put through the same set of tasks.
A deployment is an essential process focused on getting an artifact into an environment. The more a deployment process can be automated, the more repeatable, reliable, and scalable the deployment process becomes.