GitHub services experienced a 41-minute disruption yesterday

Update: Yesterday the GitHub team in a blog post stated what they have uncovered in their initial investigation, “On Monday at 3:46 pm UTC, several services on GitHub.com experienced a 41-minute disruption, and as a result, some services were degraded for a longer period. Our initial investigation suggests a logic error introduced into our deployment pipeline manifested during a subsequent and unrelated deployment of the GitHub.com website. This chain of events destabilized a number of internal systems, complicated our recovery efforts, and resulted in an interruption of service.”

It was not a very productive Monday for many developers when GitHub started showing 500 and 422 error code on their repositories. This was because several services on GitHub were down yesterday from around 15:46 UTC for 41 minutes. Soon GitHub engineers began their investigation and all the services were back to normal by 19:47 UTC.

https://twitter.com/githubstatus/status/1153391172167114752

The outage affected GitHub services including Git operations, API requests, Gist, among others. The experiences that developers reported were quite inconsistent. Some developers said that though they were able to open the main repo page, they could not see commit log or PRs. Others reported that all the git commands that required interaction with GitHub’s remotes failed.

A developer commented on Hacker News, “Git is fine, and the outage does not affect you and your team if you already have the source tree anywhere. What it does affect is the ability to do code reviews, work with issues, maybe even do releases. All the non-DVCS stuff.”

GitHub is yet to share the cause and impact of the downtime. However, developers took to different discussion forums to share what they think the reason behind GitHub outage could be. While some speculated that it might be its increasing user base, others believed it was because GitHub might be still moving “stuff to Azure after the acquisition.”

Developers also discussed what steps they can take so that such outages do not affect their workflow in the future. One developer suggested not to rely on a single point of failure by setting two different URLs for the same remote so that a single push command will push to both.

You can do something like this, a developer suggested:

git remote set-url --add --push origin git@github.com:Foo/bar.git

git remote set-url --add --push origin git@gitlab.com:Foo/bar.git

Another developer suggested, “I highly recommend running at least a local, self-hosted git mirror at any tech company, just in these cases. Gitolite + cgit is extremely low maintenance, especially if you host them next to your other production services. Not to mention, if you get the self-hosted route you can use Gerrit, which is still miles better for code review than GitHub, Gitlab, bitbucket and co.”

Others joked that this was a good opportunity to take a few hours of break and relax. “This is the perfect time to take a break. Kick back, have a coffee, contemplate your life choices. That commit can wait, that PR (i was about to merge) can wait too. It's not the end of the world,” a developer commented.

Lately, we are seeing many cases of outages. Earlier this month, almost all of Apple’s iCloud services were down for some users. On July 2, Cloudflare suffered a major outage due to a massive spike in CPU utilization in the network. Last month, Google Calendar was down for nearly three hours around the world. In May, Facebook and its family of apps Whatsapp, Messenger, and Instagram faced another outage in a row. Last year, Github faced issues due to a failure in its data storage system which left the site broken for a complete day.

Several developers took to Twitter to kill their time and vent out frustration:

https://twitter.com/jameskbride/status/1153332862587944960

https://twitter.com/BobString/status/1153329356284055552

https://twitter.com/pikesley/status/1153332278774439941

https://twitter.com/francesc/status/1153336190390550528

Cloudflare RCA: Major outage was a lot more than “a regular expression went bad”

EU’s satellite navigation system, Galileo, suffers major outage; nears 100 hours of downtime

Twitter experienced major outage yesterday due to an internal configuration issue