In this week’s bulletin, Charlie looks at how the AWS outage shows that “it does actually happen” despite all our layers of defence and resilience – a stark reminder of the close coupling of systems, where hidden dependencies can align to trigger failure on a global scale.

This week, while preparing for an exercise, I saw on BBC News about the Amazon outage. Given that my exercise scenario focused on the loss of a critical payment system at a financial institution, the timing proved particularly fitting. It reminded all the participants of the fragility of our internet infrastructure and that “it wouldn’t happen, we have this in place…” does actually happen despite all the layers of defence and resilience. It also serves as an example of the close coupling of systems I have written about in previous bulletins. The concept of close coupling refers to building new systems on top of existing ones, often without anyone having a complete overview of the entire structure. This creates inherent vulnerabilities that only become apparent when specific conditions align to cause a failure.

Looking at the AWS outage on 20 October proves a hard truth: your disaster recovery strategy likely can’t handle what just happened to over 1,000 organisations worldwide. Here’s what went wrong and what I think you should think about this coming week. 

The DNS single point of failure 

Amazon’s US-EAST-1 data centre in Virginia, its largest cluster, experienced what AWS called a “latent race condition” in its DNS system. Critical processes that store and manage Domain Name System records fell out of sync, triggering automated failures across the network.[1] 

“It’s always DNS!” is what tech professionals say, because this common error causes disproportionate havoc.[2] The problem wasn’t a cyber attack or dramatic failure – it was mundane infrastructure falling out of sync in an unlikely sequence of events. 

Action: Map your DNS dependencies now. Run a tabletop exercise asking: what happens when your DNS provider fails? Most organisations can’t answer this. 

The concentration risk regulators warned about 

The UK government holds 189 AWS contracts worth £1.7bn, with 35 public sector authorities dependent on AWS across 41 active contracts.[3] The irony? As technology partner Tim Wright noted: “The FCA and PRA have repeatedly highlighted the dangers of concentration risk in cloud service provision for regulated entities for a number of years.”[3] 

The Treasury committee has now written asking why Amazon hasn’t been designated a “critical third party” to UK financial services, which would impose regulatory oversight.[3] 

Over 2,000 companies were affected globally.[3] Lloyds Bank customers couldn’t access services until mid-afternoon. HMRC was disrupted.[1] Airlines faced check-in delays.[4] Even smart beds overheated and got stuck in inclined positions when Eight Sleep’s internet-connected mattresses lost connectivity.[1,5] 

Professor Brent Ellis called it “nested dependency”- where popular platforms rely on technical underpinnings controlled by just a few providers. “Even small service outages can ripple through the global economy.”[6] 

The migration problem: Companies face “prohibitively high” costs to move data away from AWS once embedded. Stephen Kelly of Cirata noted the explosion of enterprise data with single providers makes switching vendors financially unrealistic.[6] 

Action: Document AWS dependencies in your supply chain risk register. Which critical third parties in your ecosystem run on AWS? This should be added to your due diligence checks, and you may find that both you and your supplier have a certain Amazon data centre as a single point of failure.

The recovery reality check 

The outage began at 8am BST. Some services recovered within hours. Others, including Lloyds, Venmo, and Reddit, experienced problems until mid-afternoon.[1] Full restoration took approximately 15 hours.[7] 

Professor Mike Chapple from Notre Dame identified “cascading failures” during recovery: “It’s like a large-scale power outage. The power might flicker a few times.” Amazon initially “only addressed the symptoms” rather than the root cause.[7] 

This matters because your Recovery Time Objectives almost certainly assume faster vendor resolution. Delta Air Lines is still pursuing over £500m in losses from CrowdStrike more than a year after that incident, partly because they had to manually reset 40,000 servers even after the vendor fixed the problem.[7] 

Action: Review your RTOs. What’s your actual recovery capability when your cloud provider says “investigating”? When they say “resolved” but cascading failures continue? Are your RTOs realistic?  

What companies should have done differently 

Professor Ken Birman from Cornell was blunt: “Companies using Amazon haven’t been taking enough adequate care to build protection systems into their applications.”[7] 

The appeal of hyperscalers is clear – no hefty server costs, fluctuating traffic handled seamlessly, enhanced cyber-security.[6] But as Professor Madeline Carr noted: “Assuming they are too big to fail or inherently resilient is a mistake, with the evidence being the current outage and past ones.”[6] 

Three things to do next week: 

  1. Review your exposure to DNS failure – not just multi-zone within the same region. US-EAST-1 took down “distributed” architectures because they shared DNS infrastructure. 
  1. Run a supplier dependency audit – map which critical services (yours and your third parties’) depend on AWS, Azure, or Google Cloud.  
  1. Challenge your RTO assumptions and manual workaround abilities – ask your incident management team: is a data centre failure liable to cause us to breach our RTOs or Impact Tolerances? What manual workarounds exist, and are they documented? 

As Dr Corinne Cath-Speth from Article 19 stated: “The infrastructure underpinning democratic discourse, independent journalism and secure communications cannot be dependent on a handful of companies.”[8] 

Your next exercise should test what happens when that handful fails. 

References: 

[1] BBC News (2025). “Amazon explains cause of major AWS outage”. Available at: https://www.bbc.co.uk/news/articles/cvgvnp77dy9o

[2] BBC News (2025). “Amazon Web Services had ‘a bad day’ – what went wrong?”. Available at: https://www.bbc.co.uk/news/articles/cev1en9077ro 

[3] The Guardian (21 October 2025). “‘Significant exposure’: Amazon Web Services outage exposed UK state’s £1.7bn reliance on tech giant”. Available at: https://www.theguardian.com/technology/2025/oct/21/significant-exposure-amazon-web-services-outage-exposed-uk-states-17bn-reliance-on-tech-giant

[4] GovTech (2025). “What Caused the Major Amazon Outage on Monday?”. Available at: https://www.govtech.com/question-of-the-day/what-caused-the-major-amazon-outage-on-monday 

[5] Franceschetti, M. [@m_franceschetti] (2025). “The AWS outage has impacted some of our users since last night”. X (formerly Twitter). Available at: https://x.com/m_franceschetti/status/1980419272766583262 

[6] BBC News (2025). “What the Amazon AWS outage tells us about the state of the internet”. Available at: https://www.bbc.co.uk/news/articles/c0jdgp6n45po 

[7] BBC News (2025). “Amazon says all services restored after major outage”. Available at: https://www.bbc.co.uk/news/articles/c20pgp3nx07o 

[8] The Guardian (20 October 2025). “Amazon Web Services AWS outage hits dozens of websites and apps”. Available at: https://www.theguardian.com/technology/2025/oct/20/amazon-web-services-aws-outage-hits-dozens-websites-apps 

Scroll to Top
Scroll to Top