Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Why did last week’s Azure cloud outage happen? Here’s Microsoft’s Root Cause Analysis Summary.

Save for later
  • 3 min read
  • 12 Sep 2018

article-image

Earlier this month, Microsoft Azure Cloud was experiencing problems that left users unable to access its cloud services. The outage in South Central US affected several Azure Cloud services and caused them to go offline for U.S. users. The reason for the outage was stated as “severe weather”. Microsoft is currently conducting a root cause analysis to find out the exact reason.

Many services went offline due to cooling system failure causing the servers to overheat and turn themselves off.

What did the RCA reveal about the Azure outage


High energy storms associated with Hurricane Gordon hit the southern area of Texas near Microsoft Azure’s data centers for South Central US. Many data centers were affected and experienced voltage fluctuations. Lightning-induced increased electrical activity caused significant voltage swells. The rise in voltages, in turn, caused a portion of one data center to switch to generator power.

The power swells also shut down the mechanical cooling systems despite surge suppressors being in place. With the cooling systems being offline, temperatures exceeded the thermal buffer within the cooling system. The safe operational temperature threshold exceeded which initiated an automated shutdown of devices.

The shutdown mechanism is installed to preserve infrastructure and data integrity. But in this incident, the temperatures increased pretty quickly in some areas of the datacenter causing hardware damage before a shutdown could be initiated. Many storage servers and some network devices and power units were damaged.

Microsoft is taking steps to prevent further damage as the storms are still active in the area. They are switching the remaining data centers to generator power to stabilize power supply. For recovery of damaged units, the first step taken was to recover the Azure Software Load Balancers (SLBs) for storage scale units. The next step was to recover the storage servers and the data on them by replacing failed components and migrating data to healthy storage units while validating that no data is corrupted.

The Azure website also states that the “Impacted customers will receive a credit pursuant to the Microsoft Azure Service Level Agreement, in their October billing statement.

A detailed analysis will be available on their website in the coming weeks. For more details on the RCA and customer impact, visit the Azure website.

Real clouds take out Microsoft’s Azure Cloud; users, developers suffer indefinite Azure outage

Microsoft Azure’s new governance DApp: An enterprise blockchain without mining

Microsoft Azure now supports NVIDIA GPU Cloud (NGC)

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime