22:52 UTC, 21st Oct | Orchestrator began a process of leadership deselection as per the Raft consensus. After the Orchestrator managed to organize the US West Coast database cluster topologies and the connectivity got restored, write traffic started directing to the new primaries in the West Coast site.
The database servers in the US East Coast data center contained writes that had not been replicated to the US West Coast facility. Due to this, the database clusters in both the data centers included writes that were not present in the other data center. This is why the GitHub team was unable to failover (a procedure via which a system automatically transfers control to a duplicate system on detecting failures) the primaries back over to the US East Coast data center safely. |
22:54 UTC, 21st Oct | GitHub’s internal monitoring systems began to generate alerts indicating that the systems are undergoing numerous faults. By 23:02 UTC, GitHub engineers found out that the topologies for numerous database clusters were in an unexpected state. Later, Orchestrator API displayed a database replication topology including the servers only from the US West Coast data center.
|
23:07 UTC, 21st Oct | The responding team then manually locked the deployment tooling to prevent any additional changes from being introduced. At 23:09 UTC, the site was placed into yellow status. At 23:11 UTC, the incident coordinator changed the site status to red.
|
23:13 UTC, 21st Oct | As the issue had affected multiple clusters, additional engineers from GitHub’s database engineering team started investigating the current state. This was to determine the actions that should be taken to manually configure a US East Coast database as the primary for each cluster and rebuild the replication topology. This was quite tough as the West Coast database cluster had ingested writes from GitHub’s application tier for nearly 40 minutes.
To preserve this data, engineers decided that 30+ minutes of data written to the US West Coast data center. This prevented them from considering options other than failing-forward in order to keep the user data safe. So, they further extended the outage to ensure the consistency of the user’s data. |
23:19 UTC, 21st Oct | After querying the state of the database clusters, GitHub stopped running jobs that write metadata about things such as pushes. This lead to partially degraded site usability as the webhook delivery and GitHub Pages builds had been paused.
“Our strategy was to prioritize data integrity over site usability and time to recovery” as per the GitHub team. |
00:05 UTC, 22nd Oct | Engineers started resolving data inconsistencies and implementing failover procedures for MySQL.Recovery plan included failing forward, synchronization, fall back, then churning through backlogs before returning to green.
The time needed to restore multiple terabytes of backup data caused the process to take hours. The process to decompress, checksum, prepare, and load large backup files onto newly provisioned MySQL servers took a lot of time. |
00:41 UTC, 22nd Oct | A backup process started for all affected MySQL clusters. Multiple teams of engineers started to investigate ways to speed up the transfer and recovery time.
|
06:51 UTC, 22nd Oct | Several clusters completed restoration from backups in the US East Coast data center and started replicating new data from the West Coast. This resulted in slow site load times for pages executing a write operation over a cross-country link.
The GitHub team identified the ways to restore directly from the West Coast in order to overcome the throughput restrictions caused by downloading from off-site storage. The status page was further updated to set an expectation of two hours as the estimated recovery time. |
07:46 UTC, 22nd Oct | GitHub published a blog post for more information. “We apologize for the delay. We intended to send this communication out much sooner and will be ensuring we can publish updates in the future under these constraints”, said the GitHub team.
|
11:12 UTC, 22nd Oct | All database primaries established in US East Coast again. This resulted in the site becoming far more responsive as writes were now directed to a database server located in the same physical data center as GitHub’s application tier. This improved performance substantially but there were dozens of database read replicas that delayed behind the primary. These delayed replicas made users experience inconsistent data on GitHub.
|
13:15 UTC, 22nd Oct | GitHub.com started to experience peak traffic load and the engineers began to provide the additional MySQL read replicas in the US East Coast public cloud earlier in the incident.
|
16:24 UTC, 22nd Oct | Once the replicas got in sync, a failover to the original topology was conducted. This addressed the immediate latency/availability concerns. The service status was kept red while GitHub began processing the backlog of data accumulated. This was done to prioritize data integrity.
|
16:45 UTC, 22nd Oct | At this time, engineers had to balance the increased load represented by the backlog. This potentially overloaded GitHub’s ecosystem partners with notifications. There were over five million hook events along with 80 thousand Pages builds queued.
“As we re-enabled processing of this data, we processed ~200,000 webhook payloads that had outlived an internal TTL and were dropped. Upon discovering this, we paused that processing and pushed a change to increase that TTL for the time being”, mentions the GitHub team. To avoid degrading the reliability of their status updates, GitHub remained in degraded status until the entire backlog of data had been processed. |
23:03 UTC, 22nd Oct | At this point in time, all the pending webhooks and Pages builds had been processed. The integrity and proper operation of all systems had also been confirmed. The site status got updated to green.
|
Apart from this, GitHub has identified a number of technical initiatives and continue to work through an extensive post-incident analysis process internally.
“All of us at GitHub would like to sincerely apologize for the impact this caused to each and every one of you. We’re aware of the trust you place in GitHub and take pride in building resilient systems that enable our platform to remain highly available. With this incident, we failed you, and we are deeply sorry”, said the GitHub team.
For more information, check out the official GitHub Blog post.
Developers rejoice! Github announces Github Actions, Github connect and much more to improve development workflows
GitHub is bringing back Game Off, its sixth annual game building competition, in November
GitHub comes to your code Editor; GitHub security alerts now have machine intelligence