The use of digital has never been greater. From streaming services to cloud adoption, online banking, ecommerce, gaming and now the surge in developments aroun...
Did a bad data centre software update bring down CNN, the New York Times, the BBC, Spotify, Amazon, Google and others?
Fake news is a problem of our new, digital world. We live in times where the truth sometimes can be overshadowed by incorrect facts and viral social media posts.
Updated June 13, 2021 / Original June 12, 2021
Co-founder and Editor, The Tech Capital
June 12, 2021 | 11:37 PM BST
At 11am BST on June 8, 2021, a large number of websites and apps across the world were reported to have gone dark. CNN, the Guardian, the New York Times, the BBC, and the Verge, were some of the many media news organisations impacted.
Twitch, Pinterest, HBO Max, Hulu, Reddit and Spotify were also affected and are just part of a wider pool of content delivery networks (CDN) that also experienced trouble on June 8.
At this point, with some of the world’s largest media houses and media players affected, the story did not seem like it could get any worse. But it did.
Major internet platforms and government website also got hit, including eBay, Google, Amazon, Target, and Gov.uk.
This was what can be called a global outage and it lasted close to an hour in some cases, with a lot of users being faced with a 503 error page, meaning their browsers where unable to reach the URL desired, as the online destination was not available – essentially, the connection to those sites got cut off due to server downtown.
Internet outages are not new and in a way are becoming more common, but you could counterargument that internet usage has also exploded, creating a wider opportunity for things to go wrong. But that should not be used as an excuse, especially when livelihoods depend evermore on businesses being online.
What is not that common, is an outage of this magnitude and affecting such large websites and brands from media to governments and even some of the world’s most valuable brands like Amazon and Google.
The fault was reported back to one of the world’s largest hosting businesses and one of the key data centre providers to all these brands: US-based cloud computing provider Fastly (NYSE:FSLY).
Who is Fastly? Fastly calls itself an edge cloud platform that “enables the best of the web to thrive, and helps you deliver better online experiences”.
“Our powerful edge cloud platform empowers developers to run, secure, and deliver websites and applications – as close to the users as possible, to create unforgettable experiences at global scale,” it reads on the company’s website.
Fastly reports on its website that it receives 800bn+ code requests every day with 15tr+ log lines delivered per month.
So what went wrong? The company reacted to the incident fairly quickly after it begun – in fact, wihtin less than a minute, engineers were hard at work to fix it. The company said that a bad “service configuration” caused disruptions across the company’s global network of content delivery endpoints.
Nick Rockwell, SVP of engineering and infrastructure, wrote on the company’s blog: “We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change. We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal.”
The incident affected users across all time zones including Europe (Amsterdam (AMS), Dublin (DUB), Frankfurt (FRA), Frankfurt (HHN), London (LCY)), North America (Ashburn (BWI), Ashburn (DCA), Ashburn (IAD), Ashburn (WDC), Atlanta (FTY), Atlanta (PDK), Boston (BOS), Chicago (ORD), Dallas (DAL), Los Angeles (LAX)), and Asia/Pacific (Hong Kong (HKG), Tokyo (HND), Tokyo (TYO), Singapore (QPG)).
Fastly operated at the end of Q1 2021, in 26 countries and 58 markets in Europe, Africa, the Middle East, Asia, Oceania, North America and Latin America.
Rockwell explained that on May 12, Fastly had begun a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.
Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of Fastly’s network to return errors.
Here’s a timeline of the day’s activity (all times are in UTC), according to Rockwell:
- 09:47 Initial onset of global disruption
- 09:48 Global disruption identified by Fastly monitoring
- 09:58 Status post is published
- 10:27 Fastly Engineering identified the customer configuration
- 10:36 Impacted services began to recover
- 11:00 Majority of services recovered
- 12:35 Incident mitigated
- 12:44 Status post resolved
- 17:25 Bug fix deployment began
“Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers. We created a permanent fix for the bug and began deploying it at 17:25,” Rockwell said.
This story is true. All those brands from the media to CDNs, OTTs, governments and more were affected by the June 8 outage.
Although it sounds like it could have been exaggerated for the widespread geographic disruption caused by the outage as well as the largely affected brands involved, the reports all check out with Fastly confirming the bad code error caused the whole problem.
Astonishingly, the bad code that came to knock down world-leading websites, was lingering on the provider’s systems for nearly a month until a customer hit a ‘coding jackpot’ that no one of goodwill would want to win.
The shockwaves of the outage sent Fastly’s shares plummeting, falling almost 60% from its 52-week high as a result of the outage. Stock has since regained some ground and the week closed on the green for Fastly. In fact, by Friday closing, Fastly stock was up 4.62% compared to the same day a year ago.
Nevertheless, outages are also costly not only for the clients affected, but for providers. The June 8 blackout is likely to cause some damage to Fastly’s quarterly results as the company’s service level agreements (SLA) entitle customers with enterprise-level support deals to refunds.
The financial aftershock will only be fully visible once Fastly releases its Q2 company results, which taking into account the Q2 2019 and Q2 2020 releases, could potentially fall on August 4 or 5.
In its Q1 2021, the company reported top-line growth of 35% year-over-year with revenue of nearly US$85 million with an average enterprise customer spend of $800,000.
However, outages such as this have repercussions way behind a company’s books. A brand’s reputation can get severely damaged which could lead to a slowdown in new customer sign ups as well as renewals or even a number of contracts being cancelled.
What has Fastly’s learnt from the error? Fastly’s Rockwell explained that in the short term:
- The company is deploying the bug fix across its network as quickly and safely as possible.
- Fastly is conducting a complete post mortem of the processes and practices it followed during this incident.
- HE said the business will figure out why it did not detect the bug during the software quality assurance and testing processes.
- And it will evaluate ways to improve its remediation time.
“We have been — and will continue to — innovate and invest in fundamental changes to the safety of our underlying platforms,” Rockwell, said. “Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up. We’ll continue to update our community as we make progress toward this goal.
“This outage was broad and severe, and we are truly sorry for the impact to our customers and everyone who relies on them.
“Even though there were specific conditions that triggered this outage, we should have anticipated it. We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologise to our customers and those who rely on them for the outage and sincerely thank the community for its support.”
The Tech Capital’s Fact Checker is a project that seeks to shine a light on stories that are caus...
A common misconception exists that Disaster Recovery (DR) and Business Continuity (BC) are the same...
Businesses risk missing out on up to US$414 billion annual net profit if ineffective or no cloud ad...