Cloudflare outage in 19 data centers worldwide due to own error

Wed, 22nd Jun 2022

FYI, this story is more than a year old

Content delivery network Cloudflare says its outage for 19 of its data centers yesterday was because of a change in a long-running project to increase resilience in its busiest locations.

Many sites went down as a result, including Discord and Shopify. The company says even though these locations, which include Mumbai, Osaka, Singapore, Sydney, and Tokyo, are only 4% of its total network, the outage impacted 50% of total requests.

Cloudflare says the first outage started at 06:27 UTC on June 21. At 06:58 UTC, the company brought the first data center back online, and by 07:42 UTC, all data centers were online and working correctly.

In a statement on its blog, Cloudflare says, "We are very sorry for this outage. This was our error and not the result of an attack or malicious activity."

The company says over the last 18 months, it has been working on converting all of its busiest locations to a more flexible and resilient architecture. During this time, Cloudflare converted 19 of its data centers to this architecture, internally called Multi-Colo PoP (MCP).

Cloudflare says this new architecture provides significant reliability improvements, allowing it to run maintenance in those locations without disrupting customer traffic. But those locations also carry a substantial proportion of the Cloudflare traffic, so any problem there can have a wide impact.

In order to be reachable on the internet, networks like Cloudflare make use of a protocol called BGP. As part of this protocol, operators define policies that decide which prefixes (a collection of adjacent IP addresses) are advertised to peers (the other networks they connect to) or accepted from peers.

Cloudflare says these policies have individual components, which are evaluated sequentially. The result is that any given prefixes will either be advertised or not advertised. A policy change can mean a previously advertised prefix is no longer advertised, known as being "withdrawn", and those IP addresses will no longer be reachable on the internet.

The company says while deploying a change to its prefix advertisement policies, a re-ordering of terms caused it to withdraw a critical subset of prefixes.

Due to this withdrawal, Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. However, the company says it has backup procedures for handling such an event and used them to take control of the affected locations.

But F5 Labs threat research lead David Warburton says this should be a reminder of the dangers of the centralisation of large cloud solution providers.

"In a traditional internet app deployment model, an outage of a server or misconfigured application might take out a single website," he says.

"But similar problems with a cloud solution provider can end up taking out all of their customers, resulting in not one website being taken offline, but hundreds or thousands. The impact can potentially affect organisations' digital experiences, revenues and reputations."

Warburton says cloud solution providers provide immeasurable benefits to their users, but the re-centralisation of the internet through these cloud solutions is now causing the very problems the original design of the internet was intended to avoid through redundancy.

ServerChoice commercial director Adam Bradshaw says outages can cause businesses serious reputational and financial damage.

"For a smaller or medium-sized business, the risk associated with such failures in service can be critical. Even short outages in the digital economy could cause damage," he says.

"Diversifying an IT environment reduces the likelihood of third-party outages negatively impacting a business. Owning the hardware gives an organisation more control, with IT components such as colocation services providing an organisation with a backup when cloud services fail."

Share on: