How a Small Bug Caused AWS’s Massive Outage

A small glitch in Amazon Web Services (AWS) caused a major internet outage on Monday. It took down some of the world’s biggest apps and services.

AWS said the problem started when two automated systems tried to update the same data at the same time. That tiny bug turned into a system-wide failure. Engineers worked nonstop to fix it.

Millions of users around the world were affected. People couldn’t order food, use banking apps, or connect to hospital and smart home systems. Big brands like Netflix, Starbucks, and United Airlines also went offline, leaving customers unable to access their services.

“We apologize for the impact this event caused our customers,” Amazon said on its AWS site. “We know it affected many people in significant ways, and we’re committed to learning from it and improving reliability.”

What Went Wrong

At the heart of the issue was a conflict between two AWS programs that tried to write to the same DNS entry the digital equivalent of an entry in the internet’s phonebook. When both systems tried to update the record simultaneously, they accidentally erased it, creating an “empty entry.”

Without that key entry, other AWS services couldn’t locate each other similar to losing contact numbers in a phonebook.

“The phonebook analogy fits well,” said Angelique Medina, head of Cisco’s ThousandEyes Internet Intelligence service. “The people are still there, but if you can’t find their number, communication breaks down. That’s exactly what happened here.”

A Chain Reaction Across AWS

The empty DNS entry disrupted DynamoDB, AWS’s main database service, which then caused a chain reaction affecting other core systems—including EC2, which powers virtual servers, and Network Load Balancer, which helps distribute online traffic.

Once DynamoDB came back online, EC2 attempted to restart all servers at once—overwhelming the system and delaying recovery.

To explain it simply, Professor Indranil Gupta from the University of Illinois compared the situation to two students working in the same notebook.

“The slower student writes something, but the faster student deletes it thinking it’s outdated,” he explained. “In the end, you get an empty page just like what happened with AWS.”

Amazon’s Response and Fixes

After restoring service, Amazon engineers began implementing fixes to prevent similar issues in the future. The company is

Fixing the “race condition” that caused the overwrite conflict.
Adding more testing and monitoring systems for critical services like EC2.
Improving recovery speed so that dependent systems don’t fail together.
Enhancing real-time communication to keep customers better informed during outages.

Experts say that while large-scale cloud outages are rare, they’re almost impossible to avoid completely.

“These things happen like getting sick,” Gupta said. “What matters is how fast companies react and how clearly they communicate with users.”

FAQs

1. What caused the AWS outage?

Two automated systems attempted to update the same data simultaneously, resulting in a conflict that spread across multiple AWS services.

2. Which companies were affected due to the AWS outage?

Major brands, including Netflix, Starbucks, and United Airlines, experienced temporary downtime.

3. How long did the AWS outage last?

The disruption lasted several hours before most services were restored.

4. What is Amazon doing to prevent this again?

AWS is fixing the bug, improving its testing systems, and adding more safeguards to avoid race condition errors.

5. Are internet-wide outages common?

They’re rare but possible. With so many services running on the cloud, even one small bug can have global effects.

The Tiny Bug That Crashed the Internet: Inside AWS’s Massive Outage