Disk failure rates - help me understand this

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
I am trying to look into my crystal ball to see when my shucked HGST HE8 may fail.

What I know is:
- WD Red and other "prosumer" drives are rated at 10E-14
- HGST HE8 is rated at 10E-15
- "Anecdotal" evidence from forum users has WD Red fail at 40k hours, give or take
- I don't write a whole bunch, just two to four daily backups in the xG to xxG range each, plus whatever Plex does, which is probably, actually, more than that. That thing likes to write.

So ... 40k hours is what, 4.5 years? Am I expecting my HGST to last an order of magnitude longer? Are these hard drives likely to outlast my fleshy vessel?

Thanks! Just idle crystal ball gazing on a Thursday morning :).
 
Joined
Jan 4, 2014
Messages
1,644
45 years?! That would indeed be impressive. Somewhat suspicious though. I had a look at the datasheets and the figures you're quoting are for non-recoverable read error rates.

The MTBF figures for the HGST HE8 are 2.5 times that of the WD Reds, so 13.5 years for a drive failure is probably more realistic. Interesting that this isn't reflected in warranty terms - 3 years for WD Red and just 5 years for HGST HE8.

I'm not sure I understand the MTBF figures though. For the WD Reds it's quoted at 1M hours = 1000k hours. The reality is the 40k hours you've quoted. How do you reconcile this?

screenshot.507.png
 
Last edited:
Joined
Jan 4, 2014
Messages
1,644
Seems it's not possible to reconcile directly the 1000k hours MTBF with the observed 40k hours failure rate for WD Reds. This extract from a Wiki article on failure rates.

In practice, the mean time between failures (MTBF, 1/λ) is often reported instead of the failure rate. This is valid and useful if the failure rate may be assumed constant – often used for complex units / systems, electronics – and is a general agreement in some reliability standards (Military and Aerospace). It does in this case only relate to the flat region of the bathtub curve, which is also called the "useful life period". Because of this, it is incorrect to extrapolate MTBF to give an estimate of the service lifetime of a component, which will typically be much less than suggested by the MTBF due to the much higher failure rates in the "end-of-life wearout" part of the "bathtub curve".
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Thanks, this is starting to make more sense! 12-14 years for drive failure would still be quite amazing, and I won't be banking on it. They're coming up on a mere 2 years this fall.
 
Joined
Jan 4, 2014
Messages
1,644
Thanks, this is starting to make more sense! 12-14 years for drive failure would still be quite amazing, and I won't be banking on it. They're coming up on a mere 2 years this fall.
Let us know in 10 years time how the drives are faring :wink:
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, the first thing to keep in mind is that the figures that HDD manufacturers provide are not realistic. With that out of the way, the 10^-14 rate generally refers to sector error rate, independently of other issues. I'm not sure what typically goes into that, but I'd imagine it's a purely statistical argument based on a standard noise distribution for magnetic media, the signal level and error correction in use.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
If I am not mistaken, the "non-recoverable read error rates " or more commonly equivalent "Bit Error Rate" correspond only to the electrical interface such as SATA and is statistically derivated.
MTBF is also a statistically derivated number that involves the electronic and mechanical parts.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If they're interface error rates, that's an even more useless figure. But that doesn't sound right, since some disks claim to be better.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
The better way to analyze disk reliability is with Annualized Failure Rate (AFR) which is measured as a % that fail each year of operation.

AFR rates of drives have been improving over the years as quality has improved. If you look at the stats from Backblaze, the AFRs have reduced well below 2% and even 1%. https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2020/

However, a 1% AFR doesn't imply a 100 Year life. The drives wear out and so most vendors only measure/estimate their AFR over the 1st 5 years of a drives life. It's hard to find stats on how long, after 5 years, the drives will be reliable. Its extremely hard to find those stats on drives you have just acquired recently.. for obvious reasons.

If anyone has stories on large sets of drives and how long they really lasted.. it would be fun to hear them. I suspect they are like cars .. some of them will do a lot of miles if treated well.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
If they're interface error rates, that's an even more useless figure. But that doesn't sound right, since some disks claim to be better.
Not necessarely.
It says "unrecoverable", so I would think the protocol could detect the error and request the data again and all will be well.

 

kherr

Explorer
Joined
May 19, 2020
Messages
67
........ not mention that the MTBF hrs are true BS ..... how can they claim that even on a 1M hr drive that the MEAN time is 114 YEARS ....... I'd like them to defend that in court. When you realize that a MEAN # is found by listing ALL the failure times in a column, highest to lowest, and the figure that's in the MIDDLE of the list, is the MEAN .... pure science fiction .....
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
........ not mention that the MTBF hrs are true BS ..... how can they claim that even on a 1M hr drive that the MEAN time is 114 YEARS ....... I'd like them to defend that in court. When you realize that a MEAN # is found by listing ALL the failure times in a column, highest to lowest, and the figure that's in the MIDDLE of the list, is the MEAN .... pure science fiction .....
That's the median. The mean is sum(elements) / number of elements.
 
Joined
Jan 4, 2014
Messages
1,644
The mean is sum(elements) / number of elements.
I'm surprised! That's the first time I've heard you say mean things to anyone. :wink:
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
The drive industry internally uses FIT rates... Failures In Time measured in a Billion operating hours. In reality, 1B hours should approximate 20,000 devices operating for 50,000 hours (about 5 years).

After working out a FIT Rate, the MTBF is then calculated as 1Billion Hours/FIT. This gives a nice comforting marketing number which is honest, but misleading because of the "bathtub" failure effect. Failure rates increase after 5 years, but are not covered by the MTBF specifications.

So, a 100 year MTBF is best thought of as typically 1 failure in 20 Drives over 5 years.

Or Annualized Failure Rate = 1%
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Thank you for that explanation!

The He8 with 2.5M mtbf has, then, a FIT of 400, which translates to 1 failure in 50 drives in 5 years. If I got my math right.

And says nothing about what happens after 5 years. They clearly wear out a little slower than a drive with 1M MTBF (5 year warranty instead of 3), but that doesn’t mean they’ll last 12 years for the bulk of them. For all I know they’ll all start failing around the 7 year mark.

I’ll let you know in 10 how many of my eight survived :).
 
Joined
Jan 4, 2014
Messages
1,644
FIT, MTBF, AFR... Being the devil's advocate here, but it all seems like smoke and mirrors for manufacturers to hide behind and keep statisticians employed. Why not be open and just quote something like expected or average 'service life'? It would be far more honest and upfront.
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
FIT, MTBF, AFR... Being the devil's advocate here, but it all seems like smoke and mirrors for manufacturers to hide behind and keep statisticians employed. Why not be open and just quote something like expected or average 'service life'? It would be far more honest and upfront.

When a product is 1st built, and has had no field experience, it's very difficult to estimate the actual lifetime. It's the same problem with cars. If they gave us a lifetime number, I would not trust it unless there was a financial guarantee. They do warranty their gear for a finite lifetime... I trust that.
 
Top