Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Cloudflare suffers 2nd major internet outage in a week. This time due to globally deploying a rogue regex rule.

Save for later
  • 4 min read
  • 03 Jul 2019

article-image

For the second time in less than a week, Cloudflare was part of the major internet outage affecting many websites for about an hour, yesterday due to a software glitch. Last week, Cloudflare users faced a major outage when Verizon accidentally rerouted IP packages after it wrongly accepted a network misconfiguration from a small ISP in Pennsylvania, USA.

Cloudflare’s CTO John Graham-Cumming wrote yesterday’s outage was due to a massive spike in CPU utilization in the network.

cloudflare-suffers-2nd-major-internet-outage-in-a-week-this-time-due-to-globally-deploying-a-rogue-regex-rule-img-0

Source: Cloudflare


Many users complained of seeing "502 errors" displayed in their browsers when they tried to visit its clients. Downdetector, the website which updates users of the ongoing outages, service interruptions also flashed a 502 error message.

https://twitter.com/t_husoy/status/1146058460141772802

Graham-Cumming wrote, “This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels”.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at £16.99/month. Cancel anytime

A single misconfigured rule, the actual cause of the outage


What must have been the cause of the outage is a single misconfigured rule within the Cloudflare Web Application Firewall (WAF), deployed during a routine deployment of new Cloudflare WAF Managed rules. Though the company has automated systems to run test suites and a procedure for deploying progressively to prevent incidents, these WAF rules were deployed globally in one go and caused yesterday’s outage.

https://twitter.com/mjos_crypto/status/1146168236393807872

These new rules were to improve the blocking of inline JavaScript that is used in attacks. “Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%”, Graham-Cumming writes.

After finding out the actual cause of the issue, Cloudflare issued a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic at 1409 UTC. They also ensured that the problem was fixed correctly and re-enabled the WAF Managed Rulesets at 1452 UTC.

https://twitter.com/SwiftOnSecurity/status/1146260831899914247

“Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future”, the Cloudflare blog states.

A user said Cloudflare should have been careful of rolling out the feature globally while it was staged for a rollout.

https://twitter.com/copyconstruct/status/1146199044965797888

Cloudflare confirms the outage was ‘a mistake’ and not an attack


Cloudflare also received speculations that this outage was caused by a DDoS from China, Iran, North Korea, etc. etc, which Graham-Cumming tweeted were untrue and “It was not an attack by anyone from anywhere”.

CloudFare’s CEO, Matthew Prince, also confirmed that the outage was not a result of the attack but a “mistake on our part.”

https://twitter.com/jgrahamc/status/1146078278278635520

Many users have applauded that Cloudflare has accepted the fact that it was an organizational / engineering management issue and not an individual’s fault.

https://twitter.com/GossiTheDog/status/1146188220268470277

Prince told Inc., “I'm not an alarmist or a conspiracy theorist, but you don't have to be either to recognize that it is ultimately your responsibility to have a plan. If all it takes for half the internet to go dark for 20 minutes is some poorly deployed software code, imagine what happens when the next time it's intentional.

To know more about this news in detail, read Cloudflare’s official blog.

A new study reveals how shopping websites use ‘dark patterns’ to deceive you into buying things you may not want

OpenID Foundation questions Apple’s Sign In feature, says it has security and privacy risks

Email app Superhuman allows senders to spy on recipients through tracking pixels embedded in emails, warns Mike Davidson