Unrecoverable error, but disks seem okay?

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
I just had a scary moment when I opened my email:

Pool mypool state is ONLINE: One or more devices has experienced an
unrecoverable error. An attempt was made to correct the error. Applications
are unaffected.

And an email from a few hours later:

hostnameofmynas had an unscheduled system reboot. The operating system
successfully came back online at Sat Apr 8 12:07:35 2023

Judging from the emails I received the error was reported as a result of the most recent scrub, that was run last night. The reboot cleared the warning about my pool.

Now about my system: I'm running TrueNAS-13.0-U4 on a HPE Microserver Gen10. The disks are two WD Red 4TB in a mirrorred vdev. I have been running since FreeNAS 11.1 and inherited legacy encryption from that (which makes me a bit scared of replacing the disk, due to the required rekey step, which I only practiced years back).

Now, as for the SMART output, first ada0:

Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K2ES26KD
LU WWN Device Id: 5 0014ee 2ba0434d8
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Apr  8 14:32:52 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (45000) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 478) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   163   163   021    Pre-fail  Always       -       6833
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       43078
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       112
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       26
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       125
194 Temperature_Celsius     0x0022   126   108   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     42907         -
# 2  Extended offline    Completed without error       00%     42500         -
# 3  Extended offline    Completed without error       00%     42164         -
# 4  Extended offline    Completed without error       00%     41828         -
# 5  Extended offline    Completed without error       00%     41493         -
# 6  Extended offline    Completed without error       00%     41085         -
# 7  Extended offline    Completed without error       00%     40749         -
# 8  Extended offline    Completed without error       00%     40342         -
# 9  Extended offline    Completed without error       00%     40006         -
#10  Extended offline    Completed without error       00%     39622         -
#11  Extended offline    Completed without error       00%     39287         -
#12  Extended offline    Completed without error       00%     38690         -
#13  Extended offline    Completed without error       00%     38306         -
#14  Extended offline    Completed without error       00%     38017         -
#15  Extended offline    Completed without error       00%     37426         -
#16  Extended offline    Completed without error       00%     37227         -
#17  Extended offline    Completed without error       00%     36892         -
#18  Extended offline    Completed without error       00%     36524         -
#19  Extended offline    Completed without error       00%     36189         -
#20  Extended offline    Completed without error       00%     35781         -
#21  Extended offline    Completed without error       00%     35446         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


And ada1:

Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K1XR4UJX
LU WWN Device Id: 5 0014ee 264ad66a2
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Apr  8 14:33:00 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (44520) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 472) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   160   160   021    Pre-fail  Always       -       6958
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       43077
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       112
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       26
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       124
194 Temperature_Celsius     0x0022   126   108   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     42906         -
# 2  Extended offline    Completed without error       00%     42499         -
# 3  Extended offline    Completed without error       00%     42163         -
# 4  Extended offline    Completed without error       00%     41828         -
# 5  Extended offline    Completed without error       00%     41492         -
# 6  Extended offline    Completed without error       00%     41085         -
# 7  Extended offline    Completed without error       00%     40749         -
# 8  Extended offline    Completed without error       00%     40341         -
# 9  Extended offline    Completed without error       00%     40006         -
#10  Extended offline    Completed without error       00%     39622         -
#11  Extended offline    Completed without error       00%     39286         -
#12  Extended offline    Completed without error       00%     38689         -
#13  Extended offline    Completed without error       00%     38305         -
#14  Extended offline    Completed without error       00%     38016         -
#15  Extended offline    Completed without error       00%     37426         -
#16  Extended offline    Completed without error       00%     37227         -
#17  Extended offline    Completed without error       00%     36891         -
#18  Extended offline    Completed without error       00%     36524         -
#19  Extended offline    Completed without error       00%     36188         -
#20  Extended offline    Completed without error       00%     35781         -
#21  Extended offline    Completed without error       00%     35445         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


I find these values tough to read, so I diffed them. Only Spin_Up_Time, Power_On_Hours and Load_Cycle_Count have slightly different values between the disks, which suggests that they're (close to) default values. Also none are marked as failed. All in all I'm inclined to say that the disks are fine and that I can consider this a friendly reminder that I should make sure that I got my back-ups in order (which I mostly do). I just kicked off another scrub to be sure, though.

Can anyone think of reasons to be more concerned, additional things to check perhaps? Thanks in advance for your thoughts :smile: !
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Your ada1 has a non-zero value for 200 Multi_Zone_Error_Rate, which probably triggered the pool warning. Just to make sure your pool is OK, what's the output of zpool status -v mypool?
 

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
Your ada1 has a non-zero value for 200 Multi_Zone_Error_Rate, which probably triggered the pool warning. Just to make sure your pool is OK, what's the output of zpool status -v mypool?
Oh, good one, I must have overlooked that one.

Code:
  pool: mypool
 state: ONLINE
  scan: scrub in progress since Sat Apr  8 14:57:52 2023
    712G scanned at 886M/s, 102G issued at 127M/s, 1.24T total
    0B repaired, 8.00% done, 02:37:43 to go
config:

    NAME                                                STATE     READ WRITE CKSUM
    mypool                                              ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        gptid/1de5ca4c-f7aa-11e7-aa45-98f2b3ebbf98.eli  ONLINE       0     0     0
        gptid/1f0a9ef7-f7aa-11e7-aa45-98f2b3ebbf98.eli  ONLINE       0     0     0

errors: No known data errors
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
So it's likely ada1 recovered from the error on its own. Your pool looks OK, but check the status again after the scrub completes. You may want to have a replacement drive on hand to replace ada1 if it goes south in the future.
 

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
So it's likely ada1 recovered from the error on its own. Your pool looks OK, but check the status again after the scrub completes. You may want to have a replacement drive on hand to replace ada1 if it goes south in the future.
I will and I already have the replacement drive on hand. Thanks a lot, sir :smile: .
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Your ada1 has a non-zero value for 200 Multi_Zone_Error_Rate, which probably triggered the pool warning.
That's unlikely; SMART status has pretty much nothing to do with pool status--they're orthogonal issues that occasionally intersect, but don't directly relate.
So it's likely ada1 recovered from the error on its own
The error's automatically cleared on reboot; the fact that the pool error is gone says nothing about the status of the pool or its member disks--at least not until a scrub completes.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
That's unlikely; SMART status has pretty much nothing to do with pool status--they're orthogonal issues that occasionally intersect, but don't directly relate.
Strictly speaking, that's true. The UI will occasionally interpret SMART errors as problems with the member disk in a pool, and throw a pool warning.
 

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
Scrub just finished:

Code:
  pool: mypool
 state: ONLINE
  scan: scrub repaired 0B in 03:14:21 with 0 errors on Sat Apr  8 18:12:13 2023
config:

    NAME                                                STATE     READ WRITE CKSUM
    mypool                                              ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        gptid/1de5ca4c-f7aa-11e7-aa45-98f2b3ebbf98.eli  ONLINE       0     0     0
        gptid/1f0a9ef7-f7aa-11e7-aa45-98f2b3ebbf98.eli  ONLINE       0     0     0

errors: No known data errors



It looks like I was lucky. I should plan getting rid of that legacy encryption soon to make my job replacing the disk easier once it starts going bad :smile: .
 

Matt_G

Explorer
Joined
Jan 24, 2016
Messages
65
If you haven't already, I would stress test your replacement drive with bad blocks now, so it's ready to go at a moments notice.
 

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
Hmm, something fishy is still going on. Today I removed the geli encryption and resilvering both drives went without issue. But now I get an alert again, "One or more devices has experienced an error resulting in data corruption. Applications may be affected."

Looking at the SMART output, ada0's is nearly the same, only the power on hours and temperature are different. For ada1 it's a bit different:

Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K1XR4UJX
LU WWN Device Id: 5 0014ee 264ad66a2
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Apr 14 20:26:50 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (44520) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 472) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   160   160   021    Pre-fail  Always       -       6958
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       43227
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       112
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       26
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       124
194 Temperature_Celsius     0x0022   121   108   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     42906         -
# 2  Extended offline    Completed without error       00%     42499         -
# 3  Extended offline    Completed without error       00%     42163         -
# 4  Extended offline    Completed without error       00%     41828         -
# 5  Extended offline    Completed without error       00%     41492         -
# 6  Extended offline    Completed without error       00%     41085         -
# 7  Extended offline    Completed without error       00%     40749         -
# 8  Extended offline    Completed without error       00%     40341         -
# 9  Extended offline    Completed without error       00%     40006         -
#10  Extended offline    Completed without error       00%     39622         -
#11  Extended offline    Completed without error       00%     39286         -
#12  Extended offline    Completed without error       00%     38689         -
#13  Extended offline    Completed without error       00%     38305         -
#14  Extended offline    Completed without error       00%     38016         -
#15  Extended offline    Completed without error       00%     37426         -
#16  Extended offline    Completed without error       00%     37227         -
#17  Extended offline    Completed without error       00%     36891         -
#18  Extended offline    Completed without error       00%     36524         -
#19  Extended offline    Completed without error       00%     36188         -
#20  Extended offline    Completed without error       00%     35781         -
#21  Extended offline    Completed without error       00%     35445         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


What differs is the seek error rate. Last week's:

Code:
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0


Today's:

Code:
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0


It's very confusing/suspect that it's exactly 200, because a lot of the other parameters appear to default to that. But does this mean that my drive is dying? It's rather frustrating that it appears to be so hard to figure out if SMART values are problematic or not :frown: .

Ah well, another scrub I suppose...
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Yes, it appears the controller for ada1 is going south. You may want to just replace it.
 

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
Yes, it appears the controller for ada1 is going south. You may want to just replace it.
I'll finish the scrub and turn off my server until I can get to it, probably on Sunday.

...that actually was a risky move that I did earlier today, removing geli encryption on a mirror with a failing hard drive :eek: ...
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
My two cents: Run a SMART Long test. If it passes then your drive is still good... For Now.

I will explain a few things... The Normalized Seek Error Rate aka. RAW, is still zero. If this is a value of say 100 and growing, I'd be concerned that the drive electronics or head(s)/armature are weak and I'd recommend the drive be replaced. Seagate drives are reported differently to calculate what the RAW value means.

The flipping of Seek Error Rate VALUE, WORST, THRESH between 200/200/000 and 100/253/000 has been documented before back in 2017 (that was the earliest reference I could find with a complaint noting it, and yes those exact same values). I didn't research it enough to find out why it happens, all I know is it happens on some WD drives.

As for MultiZone Errors, one error does not fail the drive. I see nothing for ID5, ID196 and ID 197, these are key items denoting media failure, and we already talked about seek errors, another key indicator.

Could the drive have caused the system to reboot? Naw, very doubtful unless there were data corruption of the operating system. Odds are you have some faulty hardware. Intermittent problems are time consuming to troubleshoot. It could be a power supply, RAM, the motherboard, add-on card, CPU, or yes even a drive if it shorts out the power but that is highly unlikely and your system would let out the magic smoke.

My recommendation: Stress Test your system, make sure it's stable. You do have a few hours on that drive so if you wanted to replace it, that is your decision and it's your data so no one would fault you. But run the SMART Long test, see how it fairs.

Just my two cents.
 
Last edited:

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
@joeschmuck , thanks , and I think you might be right about the stability of the system. The scrub has finished and when I do a zpool status -v:

Code:
  pool: mypool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 2.62M in 03:24:04 with 0 errors on Sat Apr 15 00:14:15 2023
config:

    NAME                                            STATE     READ WRITE CKSUM
    fishtank                                        ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/1de5ca4c-f7aa-11e7-aa45-98f2b3ebbf98  ONLINE       0     0     2
        gptid/1f0a9ef7-f7aa-11e7-aa45-98f2b3ebbf98  ONLINE       0     0    23

errors: Permanent errors have been detected in the following files:

        /mnt/mypool/family/path/to/video/of/my/son.mp4


My server is an HPE Microserver Gen10. The only things that can be swapped out are disks and RAM. It does have ECC RAM, so I'm not sure if RAM is a likely candidate for all of this...?

The prospect of having to get a replacement for my server makes me wonder if I should continue to try and fix things. Given the current TCO (especially since Putin's shenenigans) and how my life got a lot busier since I became a dad I wonder if my family's needs aren't better served with a cloud solution instead. I've turned off my server for now to think this over.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
My next piece of advice: Fix your pool then test your hardware for stability.

So it looks like you will need to try and copy the one photo listed in the scrub, then delete it, then I'd run another scrub myself zpool scrub fishtank, and if the only errors you have are the CKSUM, run zpool clear fishtank to clear the errors, lastly run another zpool status -v to verify the errors are all gone. Something you might want to do as well is just make a copy of all your important data while you can.

As for the stability, run the typical CPU and RAM burn-in testing. Let them run for a while. These two tests will tell you if there is a main component going bad. But know this, if the CPU test fails, that doesn't mean the CPU is bad, it means you have a problem to figure out. It could be the power supply, a bad capacitor on the motherboard, it is just an indicator that your system is not stable.

Now if you run those two tests, the CPU test for 24 hours, then the RAM test for at least several days (I have not idea how much RAM you have nor how fast it will test) then your system is likely stable. I like to use The Ultimate Boot CD and then just boot the system form that, no need to do anything with your hardware. This CD (you can make it a bootable USB Flash drive is desired) has several CPU Stress Tests and RAM Tests.

Cloud solutions are fine depending on your needs. Some cloud solutions are free if you only need a small amount of space or little money, especially when you compare it to a new server. If your needs are low, maybe the cloud if for you. We here will never tell you that you must rebuild your TrueNAS server, do what it right for you.
 

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
Okay, have now set out to do the following:
  1. I have looked at the reported file. I can't discover anything corrupt with it. I also diffed it against back-ups of different ages. All binary equal.
  2. I have started the SMART tests. Looks like that will take up most of the day.
  3. I deleted the reported file and then replaced it with a copy from one of my back-ups.
  4. I can't really tell whether SMART tests have finished...?
  5. Tomorrow I will see if I can get my pool in a healthy state again, as per @joeschmuck's advice.
  6. When the disks are fine and the pool is healthy again I'll think of a replacement to resume operation (basically just transplant the disks into a new system).
Looking into cloud solutions made me realise why I was so happy with my NAS: they almost all do content scanning and have stories of accounts getting shut down over alleged child pornography or flagged copyrighted content without the possibility of recourse.

Final thought: even though this situation sucks it really makes me appreciate that my NAS is actually telling me that something is wrong!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I can't really tell whether SMART tests have finished...?
If you are performing the SMART Long/Extended Test, 478 minutes for your drives (assuming they are all the same model), so I'd wait 8 hours and then get another status from the drives smartctl -a /dev/ada0 and for all the drives you tested. If your NAS is active (moving data around) then the SMART Test will take a little longer as the test has a lower priority. You can post the results of each drive and we can examine those results or you could examine the results.

After you deleted the file and restored it, did you run another scrub to ensure the pool looked fine? You still may have CHSUM errors but you should complete the scrub with "0B" repaired and "0" errors. The CHSUM errors will go away when you clear the errors with the command I provided above.

Lastly, I would stress test the system with the hard drives installed. They will place the same power load on the power supply. Do not run any more hard drive tests after the SMART tests (meaning some third party hard drive testers) in order to preserve your data. While it might be an obvious statement, somethings still must be said.

When the disks are fine and the pool is healthy again I'll think of a replacement to resume operation (basically just transplant the disks into a new system).
While this is your choice, unless you have been having a lot of problems, I would just retain the current system and test it. Your hard drives are old, several years past the warranty. It is very likely the one drive is failing, not your system, but my opinion is you should test the entire system to prove to yourself the system is reliable, one failure data point is not good enough for me to replace a hard drive. If the same drive throws another error like this in the near future, I'd replace the hard drive but my entire reason for posting in this thread in the first place was to not jump to conclusions without proper testing, and it very well may be a failing hard drive but for many people that is a costly item. Remember, the hard drive is the one item that is consumable, expected to be replaced every 3 to 5 years. Most quality servers and components are going to last longer than 10 years.
 

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
Well, here are the SMART results:

ada0:
Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K2ES26KD
LU WWN Device Id: 5 0014ee 2ba0434d8
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 17 07:27:09 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (45000) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 478) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   163   163   021    Pre-fail  Always       -       6850
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       113
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       43260
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       113
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       26
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       126
194 Temperature_Celsius     0x0022   125   108   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     43248         -
# 2  Extended offline    Completed without error       00%     42907         -
# 3  Extended offline    Completed without error       00%     42500         -
# 4  Extended offline    Completed without error       00%     42164         -
# 5  Extended offline    Completed without error       00%     41828         -
# 6  Extended offline    Completed without error       00%     41493         -
# 7  Extended offline    Completed without error       00%     41085         -
# 8  Extended offline    Completed without error       00%     40749         -
# 9  Extended offline    Completed without error       00%     40342         -
#10  Extended offline    Completed without error       00%     40006         -
#11  Extended offline    Completed without error       00%     39622         -
#12  Extended offline    Completed without error       00%     39287         -
#13  Extended offline    Completed without error       00%     38690         -
#14  Extended offline    Completed without error       00%     38306         -
#15  Extended offline    Completed without error       00%     38017         -
#16  Extended offline    Completed without error       00%     37426         -
#17  Extended offline    Completed without error       00%     37227         -
#18  Extended offline    Completed without error       00%     36892         -
#19  Extended offline    Completed without error       00%     36524         -
#20  Extended offline    Completed without error       00%     36189         -
#21  Extended offline    Completed without error       00%     35781         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


And ada1:
Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K1XR4UJX
LU WWN Device Id: 5 0014ee 264ad66a2
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 17 07:27:18 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (44520) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 472) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   160   160   021    Pre-fail  Always       -       6983
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       113
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       43259
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       113
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       26
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       125
194 Temperature_Celsius     0x0022   125   108   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     43247         -
# 2  Extended offline    Completed without error       00%     42906         -
# 3  Extended offline    Completed without error       00%     42499         -
# 4  Extended offline    Completed without error       00%     42163         -
# 5  Extended offline    Completed without error       00%     41828         -
# 6  Extended offline    Completed without error       00%     41492         -
# 7  Extended offline    Completed without error       00%     41085         -
# 8  Extended offline    Completed without error       00%     40749         -
# 9  Extended offline    Completed without error       00%     40341         -
#10  Extended offline    Completed without error       00%     40006         -
#11  Extended offline    Completed without error       00%     39622         -
#12  Extended offline    Completed without error       00%     39286         -
#13  Extended offline    Completed without error       00%     38689         -
#14  Extended offline    Completed without error       00%     38305         -
#15  Extended offline    Completed without error       00%     38016         -
#16  Extended offline    Completed without error       00%     37426         -
#17  Extended offline    Completed without error       00%     37227         -
#18  Extended offline    Completed without error       00%     36891         -
#19  Extended offline    Completed without error       00%     36524         -
#20  Extended offline    Completed without error       00%     36188         -
#21  Extended offline    Completed without error       00%     35781         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


So both long tests completed without error and the only thing really different is that the multizone error rate of ada1 has increased by 1.

Here is the thing though: if I have a mirrored vdev irrepairable data corruption shouldn't happen unless both disks failed on the same bit of data or there was some other hardware problem, right? Since both disks' SMART tests are still looking good I think that I'll start out doing memtesting once I got my pool repaired (just now kicked off the scrub).

I hope that it's RAM, because that I can easily swap out (I suppose that ECC can only do so much with bad RAM). And in that case I'll replace ada1 with a fresh disk so that they're less likely to both fail around the same time.

Anyway, thank you all for bearing with me so far :smile: .
 

Dwarf Cavendish

Contributor
Joined
Dec 19, 2017
Messages
121
Code:
  pool: fishtank
 state: ONLINE
  scan: scrub repaired 0B in 03:12:23 with 0 errors on Mon Apr 17 10:43:27 2023
config:

    NAME                                            STATE     READ WRITE CKSUM
    fishtank                                        ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/1de5ca4c-f7aa-11e7-aa45-98f2b3ebbf98  ONLINE       0     0     0
        gptid/1f0a9ef7-f7aa-11e7-aa45-98f2b3ebbf98  ONLINE       0     0     0

errors: No known data errors


Scrub successful, and everything at 0 in 1 go :smile: .
As you might already have gathered: I stopped redacting the name of my pool, hence the seemingly inconsistent pool name.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
First of all, I'm very glad to see things are looking up for you. All the errors are now gone.

Now for the advice.
1. I would highly recommend that you run a daily SMART Short Test since it appears you are only running a Long test twice a month. A Short test takes 2 minutes to run and it's a quick indicator of drive failure.
2. Keep an eye on ID's 1, 5, 196, 197, and of course 200. For your drives, when these start to increment it's a sign of pending failure, with the exception of ID 200 MultiZone Errors which I've seen go both ways, sometimes it's critical, sometimes so long as the other values are good then ID 200 is not an issue. But if I see a trend of one error factor continuing to increase, I'd pay attention to the writing on the wall.
3. You have a lot of hours on both of your drives, I highly recommend you shop for replacement drives before one fails hard. Make sure you purchase a CMR drive model as it's very easy to accidentally purchase an SMR drive. If you plan to increase the capacity, now is that time.

Here is the thing though: if I have a mirrored vdev irrepairable data corruption shouldn't happen unless both disks failed on the same bit of data or there was some other hardware problem, right? Since both disks' SMART tests are still looking good I think that I'll start out doing memtesting once I got my pool repaired (just now kicked off the scrub).
I wish I could answer this for you, I just personally have not looked into why a mirror would loose/corrupt data when one drive in the mirror is failing, but it could be those crazy bit-flips that happen during a solar event. I've read it does actually happen. But I honestly do not know for sure. maybe someone will chime in to explain, or quite possibly the information is somewhere on this forum already, just need to search for it.

Best of luck to you and glad your problem is solved, for now.
 
Top