Disk failure, is it repairable?

gilgha

Dabbler
Joined
Aug 24, 2016
Messages
15
Ahoy TrueNAS community,

My ZFS pool of 4x 2TB WD Blue disks has been running for more than 3 years without issues but today, it has reached the DEGRADED state :frown:. As I understand from this post, it might not be necessary to replace the faulty drive(s) as writing to every single sector could fix the issue. Since I have no experience or knowledge in hard-drive health, could you help me assess the situation of my system to find out if one or more drive should be replaced?

At first, I started receiving alerts stating Currently unreadable (pending) sectors on disk /dev/ada[0-3] but no degradation of the pool state. Since I did not have much time to troubleshoot the issue, I kind of ignored the alerts even though they kept on coming back and for multiple disks.

Then last week, I upgraded my system to the new TrueNAS 12.0 CORE and started to migrate my current datasets to encrypted ones by copying their content using rsync -aHAX. During the copy, one of the drive (ada1) failed and completely went offline, it was not even present in the Storage > Disks section. After reboot, the drive was back up and the pool went back to an HEALTHY state so I decided to continue the datasets migration. However, another drive failed (ada0) a few hours later with a different error: Device: /dev/ada0, ATA error count increased from 0 to 20.

So I decided to stop the datasets migration before loosing my data... I performed a short and extended S.M.A.R.T. manual test on ada0 and both failed:
Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     30432         11782318
# 2  Short offline       Completed: read failure       90%     30432         11782312


Since I also received alerts for other drives, I am afraid others might critically fail in the very near future. So my questions are:
  • Should I run S.M.A.R.T. manual tests on the other drives to assess their situation? It there any risk this will cause a failure of the disk?
  • Is writing to every single sector of the drive might fix the issue? Or should the disk be considered completely broken and be replaced?
Below is the output of smartctl -a for every drive of the pool.

Code:
root@freenas:~ # smartctl -a /dev/ada0
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RC3 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD20EZRZ-00Z5HB0
Serial Number:    WD-WCC4M1XTYEEN
LU WWN Device Id: 5 0014ee 20e5e224f
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Nov  3 09:13:59 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (25980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 263) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       166
  3 Spin_Up_Time            0x0027   180   173   021    Pre-fail  Always       -       3983
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       30443
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       26
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       3147230
194 Temperature_Celsius     0x0022   122   108   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 20 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 20 occurred at disk power-on lifetime: 30431 hours (1267 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 58 b0 e6 b3 40  Error: UNC 88 sectors at LBA = 0x00b3e6b0 = 11790000

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 58 90 e6 b3 40 08      00:16:39.824  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:36.450  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:33.075  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:29.701  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:26.342  READ DMA

Error 19 occurred at disk power-on lifetime: 30431 hours (1267 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 58 b0 e6 b3 40  Error: UNC 88 sectors at LBA = 0x00b3e6b0 = 11790000

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 58 90 e6 b3 40 08      00:16:36.450  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:33.075  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:29.701  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:26.342  READ DMA

Error 18 occurred at disk power-on lifetime: 30431 hours (1267 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 58 b0 e6 b3 40  Error: UNC 88 sectors at LBA = 0x00b3e6b0 = 11790000

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 58 90 e6 b3 40 08      00:16:33.075  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:29.701  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:26.342  READ DMA

Error 17 occurred at disk power-on lifetime: 30431 hours (1267 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 58 b0 e6 b3 40  Error: UNC 88 sectors at LBA = 0x00b3e6b0 = 11790000

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 58 90 e6 b3 40 08      00:16:29.701  READ DMA
  c8 00 58 90 e6 b3 40 08      00:16:26.342  READ DMA

Error 16 occurred at disk power-on lifetime: 30431 hours (1267 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 58 b0 e6 b3 40  Error: UNC 88 sectors at LBA = 0x00b3e6b0 = 11790000

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 58 90 e6 b3 40 08      00:16:26.342  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     30432         11782318
# 2  Short offline       Completed: read failure       90%     30432         11782312

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
root@freenas:~ # smartctl -a /dev/ada1
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RC3 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD20EZRZ-00Z5HB0
Serial Number:    WD-WCC4M6YUTJ29
LU WWN Device Id: 5 0014ee 2b8d65d5e
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Nov  3 09:14:30 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (25680) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 260) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       86
  3 Spin_Up_Time            0x0027   194   177   021    Pre-fail  Always       -       3258
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   199   000    Old_age   Always       -       37
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       27634
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       2939023
194 Temperature_Celsius     0x0022   119   106   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       8

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
root@freenas:~ # smartctl -a /dev/ada2
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RC3 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD20EZRZ-00Z5HB0
Serial Number:    WD-WCC4M6YUT6LE
LU WWN Device Id: 5 0014ee 26380aba7
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Nov  3 09:15:00 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (26760) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 270) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   180   173   021    Pre-fail  Always       -       3991
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       27629
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       2928950
194 Temperature_Celsius     0x0022   120   107   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     27627         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
root@freenas:~ # smartctl -a /dev/ada3
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RC3 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD20EZRZ-00Z5HB0
Serial Number:    WD-WCC4M3LLSYT0
LU WWN Device Id: 5 0014ee 2b8d6541e
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Nov  3 09:15:21 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (25980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 263) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       200
  3 Spin_Up_Time            0x0027   179   173   021    Pre-fail  Always       -       4025
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       27614
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       2932984
194 Temperature_Celsius     0x0022   121   108   000    Old_age   Always       -       26
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       5
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       7

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
Hello,

Well it doesn't look very good on 3 of your drives... :-( ada0, ada1 and ada3 all have pending sectors which is not a good sign.
And ada0 has also lots of errors logged in.
An other attribute is the multi zone error rate. I'm not very familiar with that one (and it seems it's not a critical one) but I'd say combined with the current pending attribute might indicate a failing drive. Granted, the values are not that high though...

An other detail that caught my attention for all of the drives, the attribute 193 Load_Cycle_Count is very high (over 2 millions) which is way over the WD blue specs (if I recall correctly).
Also, you should have SMART tests running regularly on your drive (which is not the case). The good practice is to have a long test once a month and a short once a week.

You might be able to correct some of the issues by rewriting the right sector: it might either correct (false alert) or remap the sector (so attribute 198 will go up).
Your disk might still be usable but I would closely monitor it (keep an eye on attributes 197 and 198 to make sure they don't increase). You'll have to decide if you want to continue to use such disk or not. I guess that is depending on the data it holds and depending on the redundancy of the pool you have.

The easiest way to rewrite a sector would be... to rewrite the complete disk! :smile: (I'm lazy). You can use badblock in destructive mode for that. In your case it might be a good option to evaluate the disk's health. It's quite a long process and you should run it on the 3 drives (and why not on the 4th one since you're at it?).
If you want to target only a specific sector then you can use dd for that (but then you might miss other issues revealed by badblocks).

If you do that, you have to make sure to have appropriate/reliable backups!


I had a similar experience with 2TB drives and I tried to "recover" them but it didn't work. On 8 drives, I got 5 failing (i.e. starting to have pending sectors and failing long tests) and it didn't get better! I had a RAIDz2 pool with 8x 2TB drive, so having 5 drives failing was critical for the pool. :tongue:
 

gilgha

Dabbler
Joined
Aug 24, 2016
Messages
15
Thank you very much Pittfr for that extensive explanation, it is much more clear now!

I finally decided to order 4 new WD Red 2TB drives to replace the whole pool. However, I'm a bit afraid of your comment regarding the load cycle count... Is a value of over 2M is normal for around 3 years of home usage? I just checked the WD Red spec sheet which states 600.000 load/unload cycles. Anyway, I will remember to schedule regular SMART tests for those new drives, thanks for the tip :smile:.

I will also try to rewrite all 4 WD Blue disks afterwards and see if I can get them to pass the SMART long test. I really don't want to trash 4x2TB drives so I might use them for less important data...
 

Fredda

Guru
Joined
Jul 9, 2019
Messages
608
Try to cancel your order. According to your linked spec these drives are SMR drives, which are not suitable for ZFS.

 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
Regarding the LCC count, well, WD blue drives are not designed to be used 24/7. That's why they have a lower LCC spec than WD red.
And before using WD blue (or green) the timer for the LCC should be changed (by default it is 8 seconds if I recall correctly).
There is a tool for that called WDIDLE3.EXE and a thread about it...

The only thing that I can tell about the 2M value for the LCC it's that it is out of spec! :-D
It is a "normal" value for 3 years? I don't know... since on the drives I had, I deactivated the LCC timer. :tongue:

I really don't want to trash 4x2TB drives
I had the same feeling with my 5x 2TB drives failing... but there's not much to do about it... unfortunately.
 

gilgha

Dabbler
Joined
Aug 24, 2016
Messages
15
Try to cancel your order. According to your linked spec these drives are SMR drives, which are not suitable for ZFS.

Damn... Thanks for pointing that out! I will just send them back since I ordered them from Amazon this should not be a problem. I just ordered 2x 4 TB Toshiba N300 instead that I will configure in a mirror setup. It is cheaper and it will leave room for future upgrade by adding two more. Any consideration with those ones? :confused:

It is a "normal" value for 3 years? I don't know... since on the drives I had, I deactivated the LCC timer. :tongue:

I was not aware that is was still required to perform so much tweaking for the drives themselves... I will definitely research this topic when I have more time to avoid making the same mistake twice.
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
I was not aware that is was still required to perform so much tweaking for the drives themselves...
Actually I don't know for the more recent drives if it is still needed...
But in any case, for WD drives, I'd monitor the #193 attribute and if it goes abnormally up then I'd investigate (starting with WDIDLE3.EXE).
I also wouldn't worry to much for red ones (since they are supposed to be NAS specific drives) but for blue, green (or other color, i.e. non NAS specific) I'd check it out.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Should I run S.M.A.R.T. manual tests on the other drives to assess their situation?
You should have been running them on a schedule for the past three years. But since you haven't, yes, run them now to get the bad news. The output of zpool status would also be helpful to figure out just how much trouble you're in.
 
Top