Infrastructure

Fastly System Error Causes Global Content Blackout

frustrated computer user

A configuration error in the systems of a content delivery provider knocked out numerous websites and apps around the globe Tuesday.

The provider, Fastly, which supports brands like CNN, The Guardian, the New York Times, Hulu, Reddit, HBO Max and Spotify, experienced the outage at about 5:49 a.m. Eastern time in the U.S. and began to recover at 6:39 a.m.

According to National Public Radio, during the outage visitors trying to access CNN.com received the message “Fastly error: unknown domain: cnn.com.” At the New York Times and UK government’s website, an “Error 503 Service Unavailable” notice appeared, along with the line “Varnish cache server.” Varnish is a technology used by Fastly.

When reached by TechNewsWorld about the outage, a Fastly spokesperson responded with the following statement: “All Fastly cache nodes have now been restored across our global network. We identified a service configuration that triggered disruptions across our points of presence globally and have disabled that configuration.”

Content Delivery Networks

Fastly is what’s known as a content delivery network. CDNs have been around for more than 20 years, although they’ve evolved and expanded over that time.

“Most content on the internet that users interact with is getting served to them by content delivery networks,” observed Doug Madory, director of internet analysis at Kentik, a network observability company in San Francisco.

“There’s been some consolidation in the industry; so when there’s an outage, it can take out a lot of stuff,” he told TechNewsWorld.

Andy Champagne, senior vice president in the office of the CTO at Akamai, a content delivery and cloud security provider in Cambridge, Mass. explained that pumping out content from one location won’t physically work for content providers.

“You can’t build a location big enough, connected enough, and close enough to everything,” he told TechNewsWorld. “That’s why we have around 300,000 servers around the world to distribute content.”

“Anybody that’s a big brand today and even smaller brands are using content delivery networks to distribute their content,” he continued.

“One of challenges of the internet is that scale can catch you off guard,” he said. “All of a sudden something can become extremely popular. People all of sudden may want to download it, listen to it, play it, watch it, buy it. That’s where CDNs can really help. They can scale up instantly.”

Lowering Latency

Jonathan Tanner, a senior security researcher at Barracuda Networks, a security and storage solutions provider based in Campbell, Calif. explained that content delivery networks typically host frequently-loaded content, such as images for other websites or even entire websites, in a distributed manner to enable faster load times.

“Essentially, they will host the same content in multiple data centers across the world, and when a user goes to a website that loads content from the CDN, they will load that content from the closest data center to that user,” he told TechNewsWorld.

“That takes the bandwidth load off of their customer by not having larger files loading from the CDN customer’s own servers, and also enables lower latency for the users by serving content from a geographically closer location to that user than where the website of the CDN customer is being hosted,” he said.

“The CDN customer could host copies of their entire site in multiple data centers to achieve the same effect,” he added, “but this would require a lot more overhead than simply hiring a company like Fastly that does this at scale.”

Multiplying Disaster

Although details about the service configuration that caused the outage at Fastly haven’t been made public yet, CDNs can have a lot of moving parts, and the systems are constantly being updated.

“A provider usually tests the updates in stages to make sure an update isn’t going to cause a problem,” Madory explained. “Sometimes, for the sake of expediency, they make changes on the fly that don’t go through the same rigorous testing.”

A bad configuration can cause the software to crash entirely, or it might block necessary resources for the software to function properly — either of which would cause an outage, noted Tanner.

“By the very nature of how CDNs work, the same code and content is being hosted in many different data centers across the world,” he said. “So, if a bad configuration goes out it will possibly be distributed to all of those data centers and cause an outage.”

He explained that CDNs can be more resilient to outages than other kinds of systems because if one data center goes down, users will be directed to the next-closest data center for content.

“However,” he added, “a problem with the core software across all data centers will undoubtedly cause the entire service to go down.”

Upgrade Slowly

If there’s anything to be learned from the Fastly outage, it’s certainly how distributed networks play a critical role in the internet today and how important it is to make sure that the software in distributed systems is running properly.

“It also hopefully illustrated an important point about how to better handle updates in the future,” Tanner said. “That is, to not target every data center at once but rather slowly roll out software and verify it is working properly prior to pushing a major change.”

“For CDNs or any other distributed architectures, ensuring that updates to software and configurations are done in a phased manner, rather than to all data centers at once, will certainly help avert these sorts of outages in the future,” he observed.

“For those utilizing CDNs, having an action plan in the event of such an outage would also be helpful so as to reduce downtime,” he added.

Fastly isn’t alone in experiencing a headline-grabbing outage.

In October 2019, a cyberattack on Amazon Web Services left its customers without access to critical information for more than 10 hours. Meanwhile, last year IBM Cloud customers suffered a service disruption in June, Cloudflare customers complained about visitors having problems accessing their websites and services in July and in November, another AWS snafu disrupted service for its U.S. East Coast customers.

John P. Mello Jr.

John P. Mello Jr. has been an ECT News Network reporter since 2003. His areas of focus include cybersecurity, IT issues, privacy, e-commerce, social media, artificial intelligence, big data and consumer electronics. He has written and edited for numerous publications, including the Boston Business Journal, the Boston Phoenix, Megapixel.Net and Government Security News. Email John.

Leave a Comment

Please sign in to post or reply to a comment. New users create a free account.

More by John P. Mello Jr.
More in Infrastructure

E-Commerce Times Channels