3 out of 9 hard drives fail smartctl in less than 3 months -- what's the deal?

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
I've owned my FreeNAS system for about 4 years now. I have my system setup up with a RAIDZ2 configuration with 1 vdev spanning 8x5Tb WD Red Hard-drives. I have one WD 5tb drive marked as a hot spare.

Within the last 3 months, I've had or am having 3 drive failure. At first it was the Hot spare (which theoretically never gets used), then it was one of the drives in the main pool. I replaced those two drives but now I'm getting another drive failing test 1 of the smartctl short test. Needless to say I'm not happy. At this rate I'll have all the drives fail within a year.

Is this common?

Code:
# smartctl -a /dev/da3
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68L0BN1
Serial Number:    WD-WX11D86KCZF6
LU WWN Device Id: 5 0014ee 2633f0d5d
Firmware Version: 82.00A82
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Apr 12 09:56:10 2020 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 241)    Self-test routine in progress...
                    10% of test remaining.
Total time to complete Offline
data collection:         ( 4784) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 702) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   197   197   021    Pre-fail  Always       -       9133
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       45
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       27494
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       45
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       43
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       714
194 Temperature_Celsius     0x0022   123   109   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       60%     27485         3131105000
# 2  Short offline       Completed without error       00%     27317         -
# 3  Short offline       Completed without error       00%     27149         -
# 4  Short offline       Completed without error       00%     26982         -
# 5  Short offline       Completed without error       00%     26814         -
# 6  Short offline       Completed without error       00%     26647         -
# 7  Short offline       Completed without error       00%     26479         -
# 8  Short offline       Completed without error       00%     26311         -
# 9  Short offline       Completed without error       00%     26144         -
#10  Short offline       Completed without error       00%     25994         -
#11  Extended offline    Completed without error       00%       160         -
#12  Extended offline    Completed without error       00%        38         -
#13  Conveyance offline  Completed without error       00%        25         -
#14  Short offline       Completed without error       00%        25         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Going by what https://www.overclockers.com/forums...ed-help-figuring-out-SMART-temperature-values explains, your drive is currently at 41 degrees C and the worst temperature it encountered was 55 degrees C.

That seems warm. I’d check all drives for worst temp using formula “(hex to dec) raw + (value - worst)”, and improve cooling
Not sure how you're coming to that conclusion since the line is labeled Temperature_Celsius and has a value of 29. That link you posted sounds like it was from some sort of windows program trying to interpret the smart data.
 

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
There is nothing I can do about the fans -- there are a total of 7 fans in the case and for the last month the entire case has been open. With seven fans and an open case I'd have to put the machine on ice to make it any cooler.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Not sure how you're coming to that conclusion since the line is labeled Temperature_Celsius and has a value of 29. That link you posted sounds like it was from some sort of windows program trying to interpret the smart data.

According to everything I can find about smart data, the raw data in general and temperature data for wd drives in particular is given in hex, not dec.

That is a vital distinction. If you have authoritative documents on that, please share.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
there are a total of 7 fans in the case and for the last month the entire case has been open.
I'm not sure what you mean by "the entire case has been open", but if that means that the "outer case" has been removed and the system designer's airflow (mass, direction and velocity) from the installed fans is not ensured, then I wouldn't be so sure about the cooling. "open case" suggests natural convection and radiation doing the work - as opposed to forced convection which is significantly more effective...
 
Joined
Jul 10, 2016
Messages
521
According to man smartctl, the RAW_VALUE is printed in base-10.

Code:
Each Attribute has a "Raw" value, printed under the heading
"RAW_VALUE", and a "Normalized" value printed under the heading
"VALUE".  [Note: smartctl prints these values in base-10.]  In
the example just given, the "Raw Value" for Attribute 12 would
be the actual number of times that the disk has been power-
cycled, for example 365 if the disk has been turned on once per
day for exactly one year.  Each vendor uses their own algorithm
to convert this "Raw" value to a "Normalized" value in the range
from 1 to 254.  
 
Joined
May 10, 2017
Messages
838
194 Temperature_Celsius 0x0022 123 109 000 Old_age Always - 29

Current temp is 29C, worst temp according to the attribute is 43C (123-109+29)

Also, it's easier to check max lifetime temps with smartctl -x /dev/daX and there can sometimes be a little difference between the temperature attribute worst and the max recorded lifetime temp.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
It's not likely to be a temperature issue, and I'm glad that @Jurgen Segaert set me straight on decimal representation.

Which leaves what? The dreaded "same batch" failure?

> I've owned my FreeNAS system for about 4 years now

That would fit with Backblaze's drive failure data, that shows failure rates increasing after 3 to 4 years - still nothing like "3 drives out of 9" though. My best bet is, if these drives were all bought at the same time, that you are seeing an issue with that batch, that is now showing after 4 years of age.

Other ideas?
 
Joined
May 10, 2017
Messages
838
Which leaves what? The dreaded "same batch" failure?

Possibly, or just bad luck, another possibility would be abnormal vibatrions, it's also uncommon for the disk to fail without any SMART warnings, not just pending or reallocated sectors but even Raw_Read_Error_Rate and Multi_Zone_Error_Rate are still a perfect 0, these usually start climbing with WD drives before it fails a SMART test.
 

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
@Johnnie Black. I don't regularly check smartctl results other than those which show up as a warning through the FreeNAS GUI, I'm not sure if that's sufficient. The drives are all probably from the same batch since they were purchased from the same store at the same time -- perhaps it's a batch issue, however this batch issue is starting to get rather expensive.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Possibly, or just bad luck, another possibility would be abnormal vibatrions, it's also uncommon for the disk to fail without any SMART warnings, not just pending or reallocated sectors but even Raw_Read_Error_Rate and Multi_Zone_Error_Rate are still a perfect 0, these usually start climbing with WD drives before it fails a SMART test.
you burn the drives in? badblocks tool? I doubt it's related to any kind of vibration, i think it would have to be shaking the entire case.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Code:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 60% 27485 3131105000
This part is indeed very odd... the error rates were not shown in the counters, but the test itself reported 60% remaining, yet complete and with read errors.

That's some kind of indication of the onboard controller not working right in my opinion. Not much you can do about it other than RMA the disk though.
 
Joined
Jan 27, 2020
Messages
577
Not much you can do about it other than RMA the disk though.
Sadly, that's not an option anymore.

1586953781278.png
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Your experience suggests that "shucking" WD external drives, 8TB or greater (where the "Enterprise white labels" are), might make a lot of sense even though they lose all warranty. At least the replacement is affordable. I wonder how common it is for drives to fail outside their warranty period, vs. inside.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Last edited:

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
@Yorick -- I just today read report about the SMR drives. I have a replacement drive but haven't yet installed it. What's your resource for looking up the drive number to see if its SMR?
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
Argh, since posting this I've yet had another HD fail. That's the fourth one of these WD RED drives. Honestly I'd never purchase WD again. To have four out of 10 drives fail at roughly the four year mark within 6 months on a home system is frankly terrible.
 
Top