Update: Yesterday the GitHub team in a blog post stated what they have uncovered in their initial investigation, “On Monday at 3:46 pm UTC, several services on GitHub.com experienced a 41-minute disruption, and as a result, some services were degraded for a longer period. Our initial investigation suggests a logic error introduced into our deployment pipeline manifested during a subsequent and unrelated deployment of the GitHub.com website. This chain of events destabilized a number of internal systems, complicated our recovery efforts, and resulted in an interruption of service.”
It was not a very productive Monday for many developers when GitHub started showing 500 and 422 error code on their repositories. This was because several services on GitHub were down yesterday from around 15:46 UTC for 41 minutes. Soon GitHub engineers began their investigation and all the services were back to normal by 19:47 UTC.
https://twitter.com/githubstatus/status/1153391172167114752
The outage affected GitHub services including Git operations, API requests, Gist, among others. The experiences that developers reported were quite inconsistent. Some developers said that though they were able to open the main repo page, they could not see commit log or PRs. Others reported that all the git commands that required interaction with GitHub’s remotes failed.
A developer commented on
Hacker News, “
Git is fine, and the outage does not affect you and your team if you already have the source tree anywhere. What it does affect is the ability to do code reviews, work with issues, maybe even do releases. All the non-DVCS stuff.”
GitHub is yet to share the cause and impact of the downtime. However, developers took to different discussion forums to share what they think the reason behind GitHub outage could be. While some speculated that it might be its increasing user base, others believed it was because GitHub might be still moving “
stuff to Azure after the acquisition.”
Developers also discussed what steps they can take so that such outages do not affect their workflow in the future. One developer
suggested not to rely on a single point of failure by setting two different URLs for the same remote so that a single push command will push to both.
You can do something like this, a developer suggested:
git remote set-url --add --push origin git@github.com:Foo/bar.git
git remote set-url --add --push origin git@gitlab.com:Foo/bar.git
Another developer
suggested, “
I highly recommend running at least a local, self-hosted git mirror at any tech company, just in these cases. Gitolite + cgit is extremely low maintenance, especially if you host them next to your other production services. Not to mention, if you get the self-hosted route you can use Gerrit, which is still miles better for code review than GitHub, Gitlab, bitbucket and co.”
Others
joked that this was a good opportunity to take a few hours of break and relax. “
This is the perfect time to take a break. Kick back, have a coffee, contemplate your life choices. That commit can wait, that PR (i was about to merge) can wait too. It's not the end of the world,” a developer commented.
Lately, we are seeing many cases of outages. Earlier this month, almost all of
Apple’s iCloud services were down for some users. On July 2, Cloudflare suffered a
major outage due to a massive spike in CPU utilization in the network. Last month,
Google Calendar was down for nearly three hours around the world. In May, Facebook and its family of apps Whatsapp, Messenger, and Instagram faced
another outage in a row. Last year,
Github faced issues due to a failure in its data storage system which left the site broken for a complete day.
Several developers took to Twitter to kill their time and vent out frustration:
https://twitter.com/jameskbride/status/1153332862587944960
https://twitter.com/BobString/status/1153329356284055552
https://twitter.com/pikesley/status/1153332278774439941
https://twitter.com/francesc/status/1153336190390550528
Cloudflare RCA: Major outage was a lot more than “a regular expression went bad”
EU’s satellite navigation system, Galileo, suffers major outage; nears 100 hours of downtime
Twitter experienced major outage yesterday due to an internal configuration issue