SMART test failing

Darkriser

Cadet
Joined
Nov 30, 2016
Messages
5
Dear all,

I'm running TrueNAS-SCALE-22.02-RC.2 with a single pool:
(disks 4T disks are new ones, 3T disks are my own used in previous NAS)
Code:
root@truenas[~]# zpool list -v
NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
DataPool                                  6.34T  2.34T  4.01T        -         -     0%    36%  1.00x    ONLINE  /mnt
  mirror                                  3.62T  2.34T  1.29T        -         -     0%  64.4%      -    ONLINE
    038179d3-ce48-4f9a-a34f-d4c1073c5ba2      -      -      -        -         -      -      -      -    ONLINE
    db3a4457-ccef-408b-a401-131b31278967      -      -      -        -         -      -      -      -    ONLINE
  mirror                                  2.72T  1.21G  2.72T        -         -     0%  0.04%      -    ONLINE
    76db0406-4dbe-443c-99c3-dbd2ea34f2de      -      -      -        -         -      -      -      -    ONLINE
    11d527de-af45-491c-944e-298ad05d755b      -      -      -        -         -      -      -      -    ONLINE
boot-pool                                 13.5G  2.98G  10.5G        -         -     3%    22%  1.00x    ONLINE  -
  sde3                                    13.5G  2.98G  10.5G        -         -     3%  22.0%      -    ONLINE


After initial set-up I executed some SMART tests but one of my older disks failed to pass.
I tried to investigate, manually executed LONG and SELECT tests, still the same result.
I wanted to follow this post for further investigation/resolving, however this applies to ext2/ext3 only.
I assume that errors in the Attributes section (Raw_Read_Error_Rate and Multi_Zone_Error_Rate) which are actually related to this issue are not critical ones (are they?).

Code:
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N5FPESPZ
LU WWN Device Id: 5 0014ee 2b840be71
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan 16 10:01:39 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (38460) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 386) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   193   178   021    Pre-fail  Always       -       5316
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       201
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   039   039   000    Old_age   Always       -       44786
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       201
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       164
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       532
194 Temperature_Celsius     0x0022   117   107   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       5
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective offline   Completed: read failure       90%     44786         409252712
# 2  Extended offline    Completed: read failure       90%     44763         409252712
# 3  Short offline       Completed: read failure       10%     44755         409252712
# 4  Short offline       Completed without error       00%     44710         -
# 5  Short offline       Completed without error       00%     44686         -
# 6  Short offline       Completed without error       00%     44662         -
# 7  Short offline       Completed without error       00%     44638         -
# 8  Extended offline    Completed without error       00%     44621         -
# 9  Short offline       Completed without error       00%     44590         -
#10  Short offline       Completed without error       00%     44566         -
#11  Short offline       Completed without error       00%     44542         -
#12  Short offline       Completed without error       00%     44518         -
#13  Short offline       Completed without error       00%     44494         -
#14  Short offline       Completed without error       00%     44470         -
#15  Extended offline    Completed without error       00%     44453         -
#16  Short offline       Completed without error       00%     44422         -
#17  Short offline       Completed without error       00%     44398         -
#18  Short offline       Completed without error       00%     44374         -
#19  Short offline       Completed without error       00%     44350         -
#20  Short offline       Completed without error       00%     44326         -
#21  Short offline       Completed without error       00%     44302         -

SMART Selective self-test log data structure revision number 1
 SPAN    MIN_LBA    MAX_LBA  CURRENT_TEST_STATUS
    1  409252712  409252713  Completed_read_failure [90% left] (409252712-409318247)
    2          0          0  Not_testing
    3          0          0  Not_testing
    4          0          0  Not_testing
    5          0          0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


But now I'd like to know how to resolve the issue in order to stop failing the regular SMART tests.
Thanks for any help in advance...
 

Darkriser

Cadet
Joined
Nov 30, 2016
Messages
5
Thanks for the answer.
I also found this post, however I was afraid of just writing zero's to my disk without further info.
As I have mirror-type vdev -> does it mean that if I write zero's to one of the disks, the 'actual data' (if any) stored on the other disk will be mirrored back to the 'now failing' disk _automatically_ ?
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
When you write the zeros the ZFS checksum will fail on that disk and the correct data will be recovered from the mirror.
Run a Scrub to do that.
 

Darkriser

Cadet
Joined
Nov 30, 2016
Messages
5
Thanks a lot for your explanation, this makes sense.

I just wonder how zfs knows which of the disks contains correct data?
What prevents it to copy those recently written zero's to the other (good) disk?
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
just wonder how zfs knows which of the disks contains correct data?
The checksum will be bad on the disk you have written zeros to.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Zeroing a disk while it is an active member of a 2-way mirror means you won't have any redundancy.
You'd better buy a new disk, burn it in, attach it to the mirror, let the 3-way mirror resilver and then remove the failing old drive. Once your data is secure, you may run badblocks to force the old disk the failing sectors, but it is likely that further sectors will fail. Take that as a warning and prepare to replace both of the old 3 TB drives.
 

Darkriser

Cadet
Joined
Nov 30, 2016
Messages
5
MANY thanks to both of you for valuable answers.
Before I built my NAS (few days ago) I created a backup of all valuable data, so the risk of loosing redundancy is (slightly) mitigated and acceptable :smile:.

(appreciate the quick answers, really....)
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
If you are concerned about writing to a "live" drive then take it offline first and put it back in the mirror when you are done.
Please post the results of what you end up doing
 

Darkriser

Cadet
Joined
Nov 30, 2016
Messages
5
Well, some progress here, not very positive, though.
Executing this:
Code:
dd if=/dev/sdc of=/dev/zero bs=512 count=1 seek=409252712
lead to an I/O error and Current_Pending_Sector was increased to 1.
But when I use 4kb block size, everything goes well:
Code:
root@truenas[~]# dd if=/dev/sdc of=/dev/zero bs=4096 count=1 seek=409252712
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 5.1144e-05 s, 80.1 MB/s
(wait - wasn't this expected to fail and reallocate??)

Unfortunately, even after this I still see
Code:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1

And my SMART tests still keep failing
Code:
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     44800         409252712
# 2  Extended offline    Completed: read failure       90%     44794         409252712
# 3  Extended offline    Completed: read failure       90%     44793         409252712
# 4  Selective offline   Completed: read failure       80%     44793         409252712


Don't know how to investigate/resolve this further but my plan is:
1) buy 2 new 4T disks and replace the old 3T ones one-by-one (probably the best way to go here, isn't it?)
2) then I'll try to burn-in the _whole_ problematic 3T disk using 'badblocks' to see what happens.

Or do you have any other (better/cheaper) hints?
(bear in mind that my 3T disks have Power_On_Hours 44809 and 26186)
 
Last edited:

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
I think you plan is good. The drive is not working as expected.
Those hours are nothing to be concerned about. I have 64K hours on serveral WD drives.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I assume that errors in the Attributes section (Raw_Read_Error_Rate and Multi_Zone_Error_Rate) which are actually related to this issue are not critical ones (are they?).
The Multi-Zone error can be critical

# 1 Short offline Completed: read failure 90% 44800 409252712
# 2 Extended offline Completed: read failure 90% 44794 409252712
# 3 Extended offline Completed: read failure 90% 44793 409252712
These are critical as well, actually the most critical.

If you would like to try to write over those LBA's, look in my signature line for the Hard Drive Troubleshooting Guide, there is a section in there on how to accomplish this but it's for FreeBSD (Core), not Debian (Scale) so the sysctl command I do not think is required, and as previously stated, backup your data before trying anything like this.

lead to an I/O error and Current_Pending_Sector was increased to 1.
This is a good thing. Next change the count to say 10, also reduce the LBA by a few. Run the test several times. You want to have several Pending Sector Errors until it changes to 0 and the Sector Error Count changes to 1. Please understand that odds are the media on the hard drive platters is likely flaking off so odds are you will have many sector errors. This could take some time. Ensure you run a SMART Short test and if it passes, and Extended test. You could create a write scenario to continuously write to a group of LBA's for hours thus forcing the drive to recognize the failures, well that is the idea. It doesn't always work out so well.

My best advice, replace the hard drive.
 
Top