The real failure rate of EBS by QuinnyPig

8CommentsShare PostShare on Facebook Share on XShare by EmailSend Link

Vídeo

The real failure rate of EBS by QuinnyPig

ByHackTech 11 hours ago

8Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

The real failure rate of EBS
Read More

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (8)

8 Comments

Post Author

samlambert

Posted March 18, 2025 at 2:48 pm

we have a lot more content like this on the way. if anyone has feedback or questions let us know.

0Likes Log in to Reply
Post Author

reedf1

Posted March 18, 2025 at 2:55 pm

If you can detect EBS failure better than Amazon – I'd be selling this to them tomorrow.

0Likes Log in to Reply
Post Author

waynesonfire

Posted March 18, 2025 at 3:15 pm

[flagged]

0Likes Log in to Reply
Post Author

mstaoru

Posted March 18, 2025 at 3:31 pm

"What makes PlanetScale Metal performance so much better? With your storage and compute on the same server, you avoid the I/O network hops that traditional cloud databases require […] Every PlanetScale database requires at least 2 replicas in addition to the primary. Semi-synchronous replication is always enabled. This ensures every write has reached stable storage in two availability zones before it’s acknowledged to the client."

Isn't there a contradiction between these two statements?

My personal experience with EBS analogs in China (Aliyun, Tencent, Huawei clouds) is that every disk will experience a fatal failure or a disconnection at least once a month, at any provisioned IOPS. I don't know what makes them so bad, but I gave up running any kinds of DB workloads on them, using node local storage instead. If there are durability constrains, I would spin up Longhorn or Rook on top of local storage. I can see replicas degrade from time to time, but overall systems work (nothing too large, maybe ~50K QPS).

0Likes Log in to Reply
Post Author

jewel

Posted March 18, 2025 at 3:31 pm

I wonder if you could work around this problem by having two EBS volumes on each host, and write to them both. You'd have the OS report the write was successful as soon as either drive reported success. With reads you could alternate between drives for double the read performance during happy times, but quickly detect when one drive is slow and reroute those reads to the other drive.

We could call this RAID -1.

You'd need some accounting to ensure that the drives are eventually consistent, but based on the graphs of the issue it seems like you could keep the queue of pending writes in RAM for the duration of the slowdown.

Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.

0Likes Log in to Reply
Post Author

QuinnyPig

Posted March 18, 2025 at 3:32 pm

I'm a sucker for deep dive cloud nerd content like this.

0Likes Log in to Reply
Post Author

semi-extrinsic

Posted March 18, 2025 at 3:32 pm

Funny to see the plots with "No unit" on the y-axis label and then the actual units in parentheses in the title.

0Likes Log in to Reply
Post Author

c4wrd

Posted March 18, 2025 at 6:11 pm

> When attached to an EBS–optimized instance, General Purpose SSD (gp2 and gp3) volumes are designed to deliver at least 90 percent of their provisioned IOPS performance 99 percent of the time in a given year. This means a volume is expected to experience under 90% of its provisioned performance 1% of the time. That’s 14 minutes of every day or 86 hours out of the year of potential impact. This rate of degradation far exceeds that of a single disk drive or SSD.
> This is not a secret, it's from the documentation. AWS doesn’t describe how failure is distributed for gp3 volumes, but in our experience it tends to last 1-10 minutes at a time. This is likely the time needed for a failover in a network or compute component. Let's assume the following: Each degradation event is random, meaning the level of reduced performance is somewhere between 1% and 89% of provisioned, and your application is designed to withstand losing 50% of its expected throughput before erroring. If each individual failure event lasts 10 minutes, every volume would experience about 43 events per month, with at least 21 of them causing downtime!

These are some seriously heavy-handed assumptions being made, completely disregarding the data they collect. First, the author assumes that these failure events are distributed randomly and expected to happen on a daily basis, ignoring Amazon's failure rate statement throughout a year ("99% of the time annually"). Second, they argue that in practice, they see failures lasting between 1 and 10 minutes. However, they assert that we should assume each failure will last 10 minutes, completely ignoring the severity range they introduced.

Imagine your favorite pizza company claiming to deliver on time "99% of the time throughout a year." The author's logic is like saying, "The delivery driver knocks precisely 14 minutes late every day — and each delay is 10 minutes exactly, no exceptions!". It completely ignores reality: sometimes your pizza is delivered a minute late, sometimes 10 minutes late, sometimes exactly on time for four months.

As a company with useful real-world data, I expect them not to make arguments based on exaggerations but rather show cold, hard data to back up their claims. For transparency, my organization has seen 51 degraded EBS volume events in the past 3 years across ~10,000 EBS volumes. Of those events, 41 had a duration of less than one minute, nine had a duration of two minutes, and one had a duration of three minutes.

0Likes Log in to Reply

The real failure rate of EBS by QuinnyPig

The real failure rate of EBS by QuinnyPig

Share This Article

Newsletter

HackTech

8 Comments

samlambert

reedf1

waynesonfire

mstaoru

jewel

QuinnyPig

semi-extrinsic

c4wrd

Leave a comment Cancel reply

Editor's Choice

The real failure rate of EBS by QuinnyPig

The real failure rate of EBS by QuinnyPig

Share This Article

Newsletter

8 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter