Cloudflare behind the latest outage to break the internet

  • Services across the internet were down Tuesday morning thanks to a Cloudflare outage
  • Cloudflare has fixed the problem, but the incident raises questions about why this keeps happening
  • Experts told Fierce more infrastructure, regulation and better enterprise preparedness could help mitigate future outages

Yet another cloud outage has taken down huge chunks of the internet this week. This time, though, the culprit isn’t AWS, Microsoft or Google but Cloudflare, a company whose cloud services are meant to increase the security and performance of the websites using it.  

The big question now is: Why does this keep happening?

Cloudflare serves some of the biggest names on the planet, which is why the outage was felt across Google, Microsoft, OpenAI, Spotify, PayPal, Canva, X, League of Legends, and a range of other services Tuesday morning.

A Cloudflare representative told multiple news outlets that the outage was caused by “a configuration file that is automatically generated to manage threat traffic. The file grew beyond an expected size of entries and triggered a crash in the software system that handles traffic for a number of Cloudflare’s services.” The rep added the issue did not appear to be caused by malicious activity.

By 9:42 am ET, Cloudflare had implemented a fix and indicated that global recovery of its services was underway.

The incident is the latest in a string of outages to occur this year, following those suffered by Google in June, IBM in May, June and August, and AWS and Microsoft in October. Cloudflare itself had a previous outage this year as well in June.

The series once again raises questions around why this keeps happening and – more importantly – why it is allowed to happen with infrastructure that is now critical to the global economy.

What the heck is going on

In terms of the why, Cisco ThousandEyes began looking at this back in August. After analyzing all the outages that had occurred in the first half of the year, it found three distinct patterns common in the system failures: unintentional failure vectors, hidden functional failures and configuration cascade effects. 

You can check out the deep dive into each of these patterns here. But as to the why these patterns have emerged, ThousandEyes found it to be a combination of architectural evolution, the increased specialization of services that make it hard for secondary functions to make up for primary failures, and a disconnect between the root cause and the manifestation of the problem which makes it harder to predict the outages with typical healthy system metrics.

Plus, due to the critical nature of cloud systems these days, outages are just more visible.

Kirk Offel, CEO of Overwatch Mission Critical, explained it to Fierce this way: “When a core service like Cloudflare has a problem, it can create a cascading failure across a large portion of the internet.” Couple that with technology that “can iterate and evolve” much faster than humans can keep up? Well, you end up with a situation in which “we will continue to discover additional issues regarding digital infrastructure resilience and redundancy.”

What happens next?

So, how do we avoid such outages?

Offel said more and localized cloud infrastructure could help. “The only way through this is with stronger designs and more available infrastructure,” he said. The problem? “We simply don’t have a strong enough labor force to build solutions fast enough to keep up with demand. “

And what about regulation?

While there’s no real regulation for cloud providers in the U.S., IEEE Senior Member Kayne McGladrey noted that the European Union does have rules on the books under the Digital Operational Resilience Act (DORA), which directly regulates cloud providers serving financial entities in the EU and their ICT providers.

“It's quite likely we'll see DORA enforcement actions for these recent outages, as under DORA, financial entities and their ICT providers must be demonstrably resilient against these outages,” he said. “For example, firms that did not use a secondary video conferencing solution faced regulatory scrutiny under the Zoom outage earlier this year (which was caused by a DNS issue), while companies that did use a backup platform faced no such scrutiny.”

Fellow IEEE Senior Member Shaila Rana added the outages could help make the case for broader cloud regulation akin to what telecom providers face. But implementing those could be a bit tricky.

“The challenge is that it isn’t like telecom networks with clear geographic jurisdictions. Cloud services operate globally with complex interdependencies,” she said. “However, things like minimum uptime standards and independent audits should absolutely be on the table.”

Both Rana and McGladrey pointed to multi-cloud adoption as one potential mitigation strategy for enterprises looking to minimize the impact of outages. But McGladrey pointed out that this also introduces new risks and staff training and education requirements.

In terms of other potential solutions on the enterprise side, Rana argued in favor of better dependency mapping for both direct and indirect dependencies to better understand “just how interdependent our tech stack is.” Investments in observability tools, “chaos engineering practices” – that is breaking things to test resilience – and “degraded mode operational plans” should also be standard, she said.