Pool Degraded - Replaced 2 Bad Hard Drives

Gamer0126

Dabbler
Joined
Mar 24, 2015
Messages
25
Hello All,
First off:
Here is my hardware setup:

CPU: i3 - 4160.
Memory: 32 GB ECC DRAM
Case/Power Supply: SUPERMICRO CSE-743TQ-865B-SQ
Motherboard: X10SL7-F
USB: SSD 16GB
HDD: 8 x WD Red 6GB 5400RPM WD60EFRX
OS: TrueNAS-13.0-U5.2


Recently I had my Pool go down and I did see one error pop up with /dev/da5, I replaced that drive, resilver and no later than a day I get another drive failing so I replaced that drive. No concern but for each case my pool never became 100% Online. I thought after resilvering the second replaced drive the pool would refresh. However no luck. I am following these steps in bold:

- Ran a scrub - Completed 9 Errors: Which drive though?
- Currently Running a Long Smart Test - Completion to be done by tomorrow morning.
- Here is an output of my zpool status:

zpool status -v
pool: NAS2
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 09:15:06 with 9 errors on Tue Jan 16 19:39:06 2024
config:

NAME STATE READ WRITE CKSUM
NAS2 DEGRADED 0 0
raidz1-0 DEGRADED 0 0 0
gptid/a07c093b-902a-11e6-9369-0cc47a6c7ce8 DEGRADED 0 018 too many errors
gptid/a137c8b6-902a-11e6-9369-0cc47a6c7ce8 DEGRADED 0 018 too many errors
gptid/23e26fbe-ff02-11ec-80d3-0cc47a6c7ce8 DEGRADED 0 018 too many errors
gptid/72f00759-79c5-11ed-8f2f-0cc47a6c7ce8 DEGRADED 0 018 too many errors
gptid/a3608b0f-902a-11e6-9369-0cc47a6c7ce8 DEGRADED 0 018 too many errors
gptid/7cc276f3-b376-11ee-a269-0cc47a6c7ce8 ONLINE 0 018
gptid/b25e6353-b447-11ee-ac0b-0cc47a6c7ce8 ONLINE 0 018
gptid/a5bb4a90-902a-11e6-9369-0cc47a6c7ce8 DEGRADED 0 018 too many errors
errors: Permanent errors have been detected in the following files:
/mnt/NAS2/iocage/jails/qbittorrent/root/Downloads/PS2 Pack-4/Godof War (USA).7z
<0x1ba>:<0x18a044>
pool: freenas-boot
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:01:40 wit

-Attached is a screen shot of the Pool drive status:
da5 and da6 are the drives I recently replaced. They had these errors: Device: /dev/daX [SAT], 8 Currently unreadable (pending) sectors.
- I have the following questions:
Where can I find an output of the scrub to find the drive with 9 errors?
The file with permanent errors if I delete it would it fix my pool?
All 8 drives failing? Can this be true, meaning will I have to replace all the drives?

I can still access my files and backed up the most important files (family pictures), is Truenas that good that with 6 drives with errors I can still access them?

My plan:
- Find the failing drive, replace, resilver and stable NAS2.


1705471133613.png


* Device: /dev/da6 [SAT], 8 Currently unreadable (pending) sector
 
Last edited:

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
HDD: 8 x WD Red 6GB 5400RPM
Which exact model? WD Red Plus should be CMR for sure, WD Red not necessarily, are you sure these are CMR drives and not SMR?

errors: Permanent errors have been detected in the following files:
/mnt/NAS2/iocage/jails/qbittorrent/root/Downloads/PS2 Pack-4/Godof War (USA).7z
<0x1ba>:<0x18a044>
Someone more knowledgeable will chime in for sure. First of all, did you try replacing that file? Basically it told you that file was corrupted and it was not able to repair the file. I guess the corruption occured while you did not have parity.

On the other hand I still think there is some other issue since so many drives are reported as degraded. Can you state how the drives connected? Are all connected via SAS?

Two words of advice though:

USB: 16GB USB Thumb Drive
From the documentation:
You do not need an SSD boot device, but we discourage using a spinner or a USB stick. We do not recommend installing TrueNAS on a single disk or striped pool unless you have a good reason to do so. You can install and run TrueNAS without any data devices, but we strongly discourage it.

A 8 wide RAIDZ1 is way too wide! You should be using RAIDZ2 at least for 8 drives. There are various threads around here (couldn't find one quickly though that matches well) which do not encourage the use of RAIDZ1 at all.
But basically you stress 7 remaining drives during resilver with no parity left.

Do you have proper backups?
 

Gamer0126

Dabbler
Joined
Mar 24, 2015
Messages
25
Thanks for the quick response.
I had so much detail in my message I was lazy and copy/pasted from my previous posts but I have updated my hardware specs using edit. However you just clued me in on the SMR vs. CMR debate. I did this build years ago 2015 to be exact was new to the FreeNas at the time, and didn't realize this was a limiting factor. My drives are definitely SMR.

Regarding the 8 Wide RAIDZ1 comment. I did a quick search and I would be okay moving to RAIDZ2, however I would prefer not to lose the data. Is there a way to move over to RAIDZ2 without losing the data?

I wish I posted this sooner because I just purchase some more WD RED drives, which are not CMR, as replacement drives. I might be able to return them and move on to CMR drives.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Regarding the 8 Wide RAIDZ1 comment. I did a quick search and I would be okay moving to RAIDZ2, however I would prefer not to lose the data. Is there a way to move over to RAIDZ2 without losing the data?
With a degraded 8 wide raidz1 pool with SMR drives it's not really a question whether you prefer to lose the data.

I can still access my files and backed up the most important files (family pictures), is Truenas that good that with 6 drives with errors I can still access
Verify that backup. And for the future do make regular backups of your files. RAID is for availability not for backup. You still need to have backups, even when using a sound raid solution. I know that doesn't really help you now, but the fact that are still able to access and backup your files is very good. Looks like you dodged a bullet here.

I'd wait for someone else to chime in, but if all your drives are for sure SMR combined with a wide raidz1 I'd strongly consider you plan on rebuilding with CMR drives and move on from raidz1. Maybe others recommend something different than raidz2, but personally that would be my lower limit for 8 drives.
 

Gamer0126

Dabbler
Joined
Mar 24, 2015
Messages
25
Just an update I completed a the Long SMART tests with da0 did have one read error.
All my other drives successfully completed the Long SMART test without errors.

I will replace the drive, it was a read error.

However I still have the following questions:

Would deleting the file solve this read error?
Would one drive impact all the other drives being degraded?
I am expecting this will be the final fix, however I am at a loss as to what else it could be if not.
 

Gamer0126

Dabbler
Joined
Mar 24, 2015
Messages
25
Hello All,
Really looking for some help here. I replaced the da0 with the read failure from the SMART Long Test, however all the other drives are still degraded. I ran SMART Long Tests on all of them without issues results are below. I am looking for any other items I can run to fix this.

Thanks,

Code:
root@NAS2:~ # sudo smartctl -a /dev/da1
    smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
    Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
   
    === START OF INFORMATION SECTION ===
    Model Family:     Western Digital Red
    Device Model:     WDC WD60EFRX-68L0BN1
    Serial Number:    WD-WX31DA5LH4C7
    LU WWN Device Id: 5 0014ee 2b7a83add
    Firmware Version: 82.00A82
    User Capacity:    6,001,175,126,016 bytes [6.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    5700 rpm
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Sun Jan 21 07:52:22 2024 PST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
   
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
   
    General SMART Values:
    Offline data collection status:  (0x00) Offline data collection activity
                                            was never started.
                                            Auto Offline Data Collection: Disabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever
                                            been run.
    Total time to complete Offline
    data collection:                ( 4604) seconds.
    Offline data collection
    capabilities:                    (0x7b) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            Offline surface scan supported.
                                            Self-test supported.
                                            Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:        (   2) minutes.
    Extended self-test routine
    recommended polling time:        ( 700) minutes.
    Conveyance self-test routine
    recommended polling time:        (   5) minutes.
    SCT capabilities:              (0x303d) SCT Status supported.
                                            SCT Error Recovery Control supported.
                                            SCT Feature Control supported.
                                            SCT Data Table supported.
   
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0027   210   196   021    Pre-fail  Always       -       8458
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       74
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   015   015   000    Old_age   Always       -       62615
     10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       74
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       72
    193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1795
    194 Temperature_Celsius     0x0022   120   093   000    Old_age   Always       -       32
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
   
    SMART Error Log Version: 1
    No Errors Logged
   
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed without error       00%     62542         -
    # 2  Extended offline    Completed without error       00%     62522         -
    # 3  Short offline       Completed without error       00%     32969         -
    # 4  Short offline       Completed without error       00%     32968         -
    # 5  Short offline       Completed without error       00%     32967         -
    # 6  Short offline       Completed without error       00%     32966         -
    # 7  Short offline       Completed without error       00%     32965         -
    # 8  Short offline       Completed without error       00%     32964         -
    # 9  Short offline       Completed without error       00%     32963         -
    #10  Short offline       Completed without error       00%     32962         -
    #11  Short offline       Completed without error       00%     32961         -
    #12  Short offline       Completed without error       00%     32960         -
    #13  Short offline       Completed without error       00%     32959         -
    #14  Short offline       Completed without error       00%     32958         -
    #15  Short offline       Completed without error       00%     32957         -
    #16  Short offline       Completed without error       00%     32956         -
    #17  Short offline       Completed without error       00%     32955         -
    #18  Short offline       Completed without error       00%     32954         -
    #19  Short offline       Completed without error       00%     32953         -
    #20  Short offline       Completed without error       00%     32952         -
    #21  Short offline       Completed without error       00%     32951         -
   
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
    2        0        0  Not_testing
   
    RESULT IN SYSTEM FAILURE.
   
    root@NAS2:~ # sudo smartctl -a /dev/da2
    smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
    Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
   
    === START OF INFORMATION SECTION ===
    Model Family:     Western Digital Red
    Device Model:     WDC WD60EFZX-68B3FN0
    Serial Number:    WD-C82D9T7K
    LU WWN Device Id: 5 0014ee 26a2a5021
    Firmware Version: 81.00A81
    User Capacity:    6,001,175,126,016 bytes [6.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    5640 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ACS-3 T13/2161-D revision 5
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Sun Jan 21 22:46:41 2024 PST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
   
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
   
    General SMART Values:
    Offline data collection status:  (0x00) Offline data collection activity
                                            was never started.
                                            Auto Offline Data Collection: Disabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever
                                            been run.
    Total time to complete Offline
    data collection:                (  224) seconds.
    Offline data collection
    capabilities:                    (0x11) SMART execute Offline immediate.
                                            No Auto Offline data collection support.
                                            Suspend Offline collection upon new
                                            command.
                                            No Offline surface scan supported.
                                            Self-test supported.
                                            No Conveyance Self-test supported.
                                            No Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:        (   2) minutes.
    Extended self-test routine
    recommended polling time:        ( 695) minutes.
    SCT capabilities:              (0x303d) SCT Status supported.
                                            SCT Error Recovery Control supported.
                                            SCT Feature Control supported.
                                            SCT Data Table supported.
   
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0027   197   182   021    Pre-fail  Always       -       7141
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       31
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12268
     10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       29
    193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       20
    194 Temperature_Celsius     0x0022   118   092   000    Old_age   Always       -       34
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

   
    SMART Error Log Version: 1
    No Errors Logged
 


SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     12178         -
# 2  Extended offline    Completed without error       00%     12159         -
# 3  Short offline       Completed without error       00%         2         -

Selective Self-tests/Logging not supported

root@NAS2:~ # sudo smartctl -a /dev/da3
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFZX-68B3FN0
Serial Number:    WD-C82DJ18K
LU WWN Device Id: 5 0014ee 26a2a3aab
Firmware Version: 81.00A81
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5640 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan 21 22:48:36 2024 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (62160) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 658) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   200   185   021    Pre-fail  Always       -       7000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       24
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8528
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       24
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       22
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       6
194 Temperature_Celsius     0x0022   118   094   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8438         -
# 2  Extended offline    Completed without error       00%      8418        


SMART Self-test log structure revision number 1
Num  Test_Description    Status               

# 1  Extended offline    Completed without error       00%      8438         -
# 2  Extended offline    Completed without error       00%      8418         -

Selective Self-tests/Logging not supported

root@NAS2:~ # sudo smartctl -a /dev/da4
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68L0BN1
Serial Number:    WD-WX31DA5LH6X3
LU WWN Device Id: 5 0014ee 2b7a8a453
Firmware Version: 82.00A82
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan 21 22:51:24 2024 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 1724) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 671) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   196   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   211   197   021    Pre-fail  Always       -       8441
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       71
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   015   015   000    Old_age   Always       -       62571
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       71
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       69
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1737
194 Temperature_Celsius     0x0022   119   093   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     62483         -
# 2  Extended offline    Completed without error       00%     62463         -
# 3  Short offline       Completed without error       00%     32969         -
# 4  Short offline       Completed without error       00%     32968         -
# 5  Short offline       Completed without error       00%     32967         -
# 6  Short offline       Completed without error       00%     32966         -
# 7  Short offline       Completed without error       00%     32965         -
# 8  Short offline       Completed without error       00%     32964         -
# 9  Short offline       Completed without error       00%     32963         -
#10  Short offline       Completed without error       00%     32962         -
#11  Short offline       Completed without error       00%     32961         -
#12  Short offline       Completed without error       00%     32960         -
#13  Short offline       Completed without error       00%     32959         -
#14  Short offline       Completed without error       00%     32958         -
#15  Short offline       Completed without error       00%     32957         -
#16  Short offline       Completed without error       00%     32956         -
#17  Short offline       Completed without error       00%     32955         -
#18  Short offline       Completed without error       00%     32954         -
#19  Short offline       Completed without error       00%     32953         -
#20  Short offline       Completed without error       00%     32952         -
#21  Short offline       Completed without error       00%     32951         -


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@NAS2:~ # sudo smartctl -a /dev/da7
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@NAS2:~ # sudo smartctl -a /dev/da7
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68MYMN1
Serial Number:    WD-WX11DC4498E0
LU WWN Device Id: 5 0014ee 20bb8f5e6
Firmware Version: 82.00A82
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan 21 22:56:30 2024 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 3824) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 692) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   229   189   021    Pre-fail  Always       -       7525
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       72
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   015   015   000    Old_age   Always       -       62703
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       72
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       71
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2195
194 Temperature_Celsius     0x0022   121   098   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     62614         -
# 2  Extended offline    Completed without error       00%     62595         -
# 3  Short offline       Completed without error       00%     33036         -
# 4  Short offline       Completed without error       00%     33035         -
# 5  Short offline       Completed without error       00%     33034         -
# 6  Short offline       Completed without error       00%     33033         -
# 7  Short offline       Completed without error       00%     33032         -
# 8  Short offline       Completed without error       00%     33031         -
# 9  Short offline       Completed without error       00%     33030         -
#10  Short offline       Completed without error       00%     33029         -
#11  Short offline       Completed without error       00%     33028         -
#12  Short offline       Completed without error       00%     33027         -
#13  Short offline       Completed without error       00%     33026         -
#14  Short offline       Completed without error       00%     33025         -
#15  Short offline       Completed without error       00%     33024         -
#16  Short offline       Completed without error       00%     33023         -
#17  Short offline       Completed without error       00%     33022         -
#18  Short offline       Completed without error       00%     33021         -
#19  Short offline       Completed without error       00%     33020         -
#20  Short offline       Completed without error       00%     33019         -
#21  Short offline       Completed without error       00%     33018         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing

 
Last edited:

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Please use [#CODE][#/CODE] tags (without the #).

You didn't answer that question although I doubt it will be a bad HBA..
Can you state how the drives connected? Are all connected via SAS?

Code:
# 1 Extended offline Completed without error 00% 62542 -
# 2 Extended offline Completed without error 00% 62522 -
# 3 Short offline Completed without error 00% 32969 -

You do realize you didn't complete any smart tests over the last 30000 hours :eek:

I wish I could help you, but this is above my knowledge.
 

Gamer0126

Dabbler
Joined
Mar 24, 2015
Messages
25
Ohh I missed that question:
The drives are connected via the Broadcomm Card on the Supermicro X10SL7-F, which is SAS2.

I also was surprised with that log because I have it scheduled in TrueNAS, along with scrubing.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Please use [#CODE][#/CODE] tags (without the #).
You can edit your previous post ;)

I also was surprised with that log because I have it scheduled in TrueNAS, along with scrubing.
I never had a smart test not run -> you should definitely check why they did not run / complete.

my 2 cents; Given 60k hours of lifetime, SMR drives and all your errors you should make sure there is no hardware failure (mainboard) and start fresh with a set of CMR drives. But as long as you still have access to your data: Create and verify a backup if not done already.
Your drives are already running for 7 years, personally I'd say they were worth their investment and may retire now.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
What firmware are you running on your SAS controller?
 

Gamer0126

Dabbler
Joined
Mar 24, 2015
Messages
25
What firmware are you running on your SAS controller?
Hello,
Sorry for the late response, hoping attention hasn't been given up on this thread and helping me. Regarding the SAS Controller, here is the information I have found in the bios settings:
SAS2300-1
Bios Date: 2014.09.10
Revision ID: 05
Version: 7.39.00.00

The Main Firmware version is:
Supermicro X10SL7-F
Version: 3.0

I realize now I should update the firmware however at this point I wouldn't want to change the firmware to risk something else failing. I am a believer in not updating firmware in the middle of an issue that popped up. I am open to any suggestions if updating the firmware does provide a solution.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
That's not the SAS controller firmware. Post the output of sas2flash -list in code tags.
 

Gamer0126

Dabbler
Joined
Mar 24, 2015
Messages
25
That's not the SAS controller firmware. Post the output of sas2flash -list in code tags.
Sorry for that I didn't realize that was the incorrect information. Here is the output from the command you provided me:

Firmware Product ID : 0x2214 (IT)
Firmware Version : 20.00.04.00
NVDATA Vendor : LSI
NVDATA Product ID : LSI2308-IT
BIOS Version : 07.39.00.00


Hope that helps you help me :)
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
That's the same firmware I'm running so at least that's good. Your previous post indicates that there is metedata corruption.
Code:
errors: Permanent errors have been detected in the following files:
       <0x1ba>:<0x18a044>

Your likely option forward is to rebuild your pool and restore from a backup source. Hopefully you have a backup source of data. I would STRONGLY suggest you move away from RAIDZ1 as well and add some more parity to your pool to avoid this situation in the future.
 

Gamer0126

Dabbler
Joined
Mar 24, 2015
Messages
25
That's the same firmware I'm running so at least that's good. Your previous post indicates that there is metedata corruption.
Code:
errors: Permanent errors have been detected in the following files:
       <0x1ba>:<0x18a044>

Your likely option forward is to rebuild your pool and restore from a backup source. Hopefully you have a backup source of data. I would STRONGLY suggest you move away from RAIDZ1 as well and add some more parity to your pool to avoid this situation in the future.
Thanks I was able to delete and clear and run a full scrub. However the error was still present/listed. It wasn't until I found a separate procedure. Just to summarize the issues and solutions I just went through to solve the multiple issues I faced.

- First Issue was with the Metadata Failures:
<0x1ba>:<0x18a044>
This was resolved by replacing failed drives from SMART drive tests.

-Second Issue was all the errors with files:
Permanent errors have been detected in the following files:
/mnt/NAS2/iocage/jails/qbittorrent/root/Downloads/PS2 Pack-4/Godof War (USA).7z
along with other files


This was fixed by deleting listed files with errors then following this procedure:
I have no idea why completing the full scrub did not clear the error, after clearing the zpool.

I hope this will help others in the future.
 
Top