"2.5 million hours MTBF" How do they calculate that? That works out as 285 years...

alvis · on Nov 4, 2017

These two wiki explain well what this number mean.

[1] https://en.wikipedia.org/wiki/Reliability_engineering [2] https://en.wikipedia.org/wiki/Survival_analysis

jsmthrowaway · on Nov 4, 2017

It’s predicted statistical failure rate in aggregate for the drive model. Two drives running for 24 hours is 48 hours for MTBF purposes, so scale MTBF by the number of drives. 1,000 drives would expect to have one failure per 104 days. 10,000 drives would expect to have one failure about every week and a half. It’s not “half these drives will run 2.5 million hours.” I agree, the “mean” is confusing until you know that. It also doesn’t predict low quantities well, including 1, for obvious reasons. The larger the quantity gets, the better the prediction tends to get IME. Backblaze has published work on this.

Failure rate is also independent of lifetime. Drives have other measurements for lifetime.

yosito · on Nov 4, 2017

So, in layman's terms, if I buy two of these and mirror them is it a pretty safe bet they'll last my whole life?

jsmthrowaway · on Nov 4, 2017

No, because design lifetime is distinct from failure rate. Failure rate is just that: a predicted rate of failure (not lifetime) in aggregate for a model within design lifetime. Beyond lifetime, all bets are off. Think of this MTBF as saying “your two drives probably won’t fail within lifetime. A significant number of your 10,000 will.”

Regarding longevity, often the predicted lifetime of a drive is close to its warranty. You will sometimes experience no issues exceeding design lifetime, and sometimes drives immediately explode. I’ve seen both, from four-year lifetime drives entering year 13 in continuous service to other drives buying the farm one day after lifetime and SMART wear indicator is fired.

As drives age, mechanical disruption becomes a much bigger deal. That rack of 13-year drives is probably one earthquake or heavy walker away from completely dead in every U. Even power loss, including from regular shutdown, will probably permanently end the drive when they’re far beyond lifetime. That’s the danger in a 24x7 server setting if you’re not monitoring SMART wear indicators (even if you are, really); power cycling your rack can, and does, trigger multiple hardware failures. All the time. If all the drives in it were from the same batch, installed at the same time, and an equal amount past lifetime, it’s very possible for the whole rack to fail when cycled — I have actually heard of this happening, once.

MTBF is unexpected failure. Design lifetime is expected failure.

DamonHD · on Nov 4, 2017

I've had a nice rack of discs, though small by today's standards, AFAIK well within any MTBF, wrecked when some junior decided to see what the phased shutdown button in the machine room did during the middle of the bank's trading day, cutting off half the power to my cabinet at one point, which SCSI doesn't protect against.

There's always events dear boy, events!

Spooky23 · on Nov 4, 2017

Nooo. MBTF is a engineering term of art.

MTBF lets you compare between two drives intended for a similar purpose.

For something that you own at home, the only medium that will last your lifetime in loosely controlled environmental conditions is paper.

wongarsu · on Nov 4, 2017

High-quality paper. A lot of places have problems that the paper records they have to keep for 10 years become unreadable in less than 10 years.

Archival CDs might also last a lifetime (regular CDs certainly won't).

But I agree with your point that the MBTF of a single unit is not a way to predict it's lifetime.

Spooky23 · on Nov 4, 2017

That's a great point.

It dawned on me when some older relatives died a few years ago that their papers, some very old, survive with a modicum of careful handling. My grandfather's immigration papers, great-grandmothers portrait on her wedding day, etc.

Today, many of us are one expired credit card away from losing all of that in digital form.

mozumder · on Nov 4, 2017

What are acceptable low-number aggregate? 10? 100?

jsmthrowaway · on Nov 4, 2017

When the calculated result starts exceeding what is reasonable or makes any sense. It varies, and is a gut feeling. The more you have, the better.

Having 100 of this model would predict a failure about every three years (but that doesn’t mean it’d take 300 years to fail all of them). I’d be suspicious of a three year calculation, but it very well might turn out to be accurate. Remember that lifetime and failure rate are independent metrics. They’re warranted for five years, which is probably close to their predicted lifetime, and one drive of 100 failing in that time is certainly plausible.

Having 10 basically implies none of them will fail within a five-year lifetime. Again, plausible, but I think less likely.

piaste · on Nov 4, 2017

I assume you take the first batch of N drives, run them for K weeks, see how many fail, and extrapolate.

The methodology ought to be published, though.

jsmthrowaway · on Nov 4, 2017

It is. There’s a JEDEC standard for SSDs and probably one for spinners. Different companies use different write loads to find it. You’re pretty bang on, actually, but some vendors test for years.