Degraded Pool - No Errors

dr_kiuchi

Cadet
Joined
Mar 17, 2024
Messages
8
Hi Folks,

Looking for some assistance here as I'm at my wits end. I recently upgraded my pool, but notably failed to do an extensive burn in on the drives. I replaced 2 of them after detecting heat issues leading to actual failures upon S.M.A.R.T. tests. After resilvering and replacing, everything seemed to be ok. Later, upon scrubbing, ALL drives showed up in a degraded status. I went through the following motions after spending god-only-knows how many hours reading through posts here for similar issues before posting.

1. Smart tests: short, conveyance, and long tests on each of the drives with no errors, and no failures. (More specifically I paid attention to Reallocated_Sector_Ct or Pending which are zero across all drives.)

Results:
Code:
=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST12000VN0008-2YS101
Serial Number:    ZR906K4V
LU WWN Device Id: 5 000c50 0e664bb2a
Firmware Version: SC60
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar 17 08:05:47 2024 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  559) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (1048) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       217656664
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       29
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   060   045    Pre-fail  Always       -       48013411
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       522
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       31
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   034   000    Old_age   Always       -       39 (Min/Max 38/52)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       22
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       634
194 Temperature_Celsius     0x0022   039   066   000    Old_age   Always       -       39 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       242h+13m+04.236s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       10840822454
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       83557889527

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%       522         -
# 2  Extended offline    Completed without error       00%       521         -
# 3  Short offline       Completed without error       00%       498         -

----

root@truenas[~]# smartctl -a /dev/ada1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST12000VN0008-2YS101
Serial Number:    ZRT0SK89
LU WWN Device Id: 5 000c50 0e6e34d38
Firmware Version: SC60
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar 17 08:09:55 2024 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  567) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (1041) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   064   044    Pre-fail  Always       -       233195960
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       28
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   060   045    Pre-fail  Always       -       48458018
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       522
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       31
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   034   000    Old_age   Always       -       38 (Min/Max 37/51)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       21
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       635
194 Temperature_Celsius     0x0022   038   066   000    Old_age   Always       -       38 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       242h+20m+46.898s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       10841563747
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       83555086222

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%       522         -
# 2  Extended offline    Completed without error       00%       521         -
# 3  Short offline       Completed without error       00%       498         -

----
root@truenas[~]# smartctl -a /dev/ada2
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST12000VN0008-2YS101
Serial Number:    ZV70GQVJ
LU WWN Device Id: 5 000c50 0e7fa33f7
Firmware Version: SC60
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Mar 17 08:10:27 2024 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  567) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (1036) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   064   044    Pre-fail  Always       -       237145616
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       29
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   060   045    Pre-fail  Always       -       48020276
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       522
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       31
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   059   042   000    Old_age   Always       -       41 (Min/Max 41/55)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       22
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       634
194 Temperature_Celsius     0x0022   041   058   000    Old_age   Always       -       41 (0 19 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       242h+26m+20.094s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       10839095938
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       83549278837

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%       522         -
# 2  Extended offline    Completed without error       00%       520         -
# 3  Short offline       Completed without error       00%       498         -

----

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST12000VN0008-2YS101
Serial Number:    ZRT17J56
LU WWN Device Id: 5 000c50 0e838eae7
Firmware Version: SC60
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Mar 17 08:10:54 2024 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  567) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (1065) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   074   065   044    Pre-fail  Always       -       25827912
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   074   060   045    Pre-fail  Always       -       24312920
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       106
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       7
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   049   000    Old_age   Always       -       38 (Min/Max 38/51)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       10
194 Temperature_Celsius     0x0022   038   051   000    Old_age   Always       -       38 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       105h+35m+06.696s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       24179512580
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       16914344880

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%       105         -
# 2  Extended offline    Completed without error       00%       104         -
# 3  Short offline       Completed without error       00%        82         -


-----

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST12000VN0008-2YS101
Serial Number:    ZRT18M7M
LU WWN Device Id: 5 000c50 0e8499305
Firmware Version: SC60
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar 17 08:11:28 2024 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  567) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (1060) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   073   064   044    Pre-fail  Always       -       19635272
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   074   060   045    Pre-fail  Always       -       23665851
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       106
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       7
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   059   045   000    Old_age   Always       -       41 (Min/Max 40/55)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   041   055   000    Old_age   Always       -       41 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       105h+28m+29.563s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       10545364484
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       30564777026

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%       105         -
# 2  Extended offline    Completed without error       00%       104         -
# 3  Short offline       Completed without error       00%        82         -


2. Checked cabling, routing, etc. reseated all the drives and quadruple checked. Ensured airflow / temp were ok. This time moved cabling from the mobo to the SAS controller for simplicity.

3. Zpool clear + another scrub, resulting in degraded status for all disks again

4. Re-run smart tests (just in case) - still no errors.

5. Ran another zpool status-v revealed a handful of media files where permanent errors have been detected. I haven't gone as far as deleting them yet.

Code:
pool: NewPool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 09:43:31 with 10 errors on Mon Mar 18 02:24:48 2024
config:

    NAME                                            STATE     READ WRITE CKSUM
    MyPool                                      DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/7c875ab1-db1a-11ee-abaf-94de806254ce  DEGRADED     0     0    20  too many errors
        gptid/013ed923-e0df-11ee-ad17-94de806254ce  DEGRADED     0     0    20  too many errors
        gptid/7cb37fd5-db1a-11ee-abaf-94de806254ce  DEGRADED     0     0    20  too many errors
        gptid/7cbedb7c-db1a-11ee-abaf-94de806254ce  DEGRADED     0     0    20  too many errors
        gptid/ed064fd0-e0de-11ee-ad17-94de806254ce  DEGRADED     0     0    20  too many errors

errors: Permanent errors have been detected in the following files:

        MyPool/Media@manual-2024-03-15_18-25:/TV Shows/XYZ.whatever
        MyPool/Media@manual-2024-03-15_18-25:/TV Shows/XYZ.whatever
        MyPool/Media@manual-2024-03-15_18-25:/Movies/XYZ.whatever
        MyPool/Media@manual-2024-03-15_18-25:/TV Shows/XYZ.whatever
        MyPool/Media@manual-2024-03-15_18-25:/TV Shows/XYZ.whatever
      
  pool: boot-pool
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
    Expect reduced performance.
action: Replace affected devices with devices that support the
    configured block size, or migrate data to a properly configured
    pool.
  scan: scrub repaired 0B in 00:00:03 with 0 errors on Fri Mar 15 03:45:03 2024
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        da3p2   ONLINE       0     0     0
        da4p2   ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: No known data errors


5. My next steps (once the current long re-test is complete) are as follows:
- Bring the old pool back online and attempt to replace the corrupted files
- Delete them and re-scrub
- Replace the controller & / or the cables
- Yeet the server out a window and build something from scratch

Any feedback here would be greatly appreciated. Cheers.


 

dr_kiuchi

Cadet
Joined
Mar 17, 2024
Messages
8
Adding dmesg results, doesn't look hardware related.

Code:
root@truenas[~]# dmesg
Copyright (c) 1992-2021 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 13.1-RELEASE-p9 n245429-296d095698e TRUENAS amd64
FreeBSD clang version 13.0.0 (git@github.com:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a303)
VT(vga): text 80x25
CPU: Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz (3403.50-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x306a9  Family=0x6  Model=0x3a  Stepping=9
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7fbae3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  Structured Extended Features=0x281<FSGSBASE,SMEP,ERMS>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: (disabled in BIOS) PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 17179869184 (16384 MB)
avail memory = 16496459776 (15732 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <ALASKA A M I>
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s) x 2 hardware threads
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
random: unblocking device.
ioapic0 <Version 2.0> irqs 0-23
Launching APs: 1 2 5 7 4 3 6
random: entropy device external interface
kbd1 at kbdmux0
vtvga0: <VT VGA driver>
aesni0: <AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS>
padlock0: No ACE support.
acpi0: <ALASKA A M I>
acpi0: Power Button (fixed)
cpu0: <ACPI CPU> on acpi0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 550
atrtc0: <AT realtime clock> port 0x70-0x77 irq 8 on acpi0
atrtc0: Warning: Couldn't map I/O.
atrtc0: registered as a time-of-day clock, resolution 1.000000s
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
mps0: <Avago Technologies (LSI) SAS2308> port 0xe000-0xe0ff mem 0xf7d40000-0xf7d4ffff,0xf7d00000-0xf7d3ffff irq 16 at device 0.0 on pci1
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
vgapci0: <VGA-compatible display> port 0xf000-0xf03f mem 0xf7800000-0xf7bfffff,0xe0000000-0xefffffff irq 16 at device 2.0 on pci0
vgapci0: Boot video device
xhci0: <Intel Panther Point USB 3.0 controller> mem 0xf7f00000-0xf7f0ffff irq 16 at device 20.0 on pci0
xhci0: 32 bytes context size, 64-bit DMA
xhci0: Port routing mask set to 0xffffffff
usbus0 on xhci0
usbus0: 5.0Gbps Super Speed USB v3.0
pci0: <simple comms> at device 22.0 (no driver attached)
ehci0: <Intel Panther Point USB 2.0 controller> mem 0xf7f18000-0xf7f183ff irq 16 at device 26.0 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci0
usbus1: 480Mbps High Speed USB v2.0
pci0: <multimedia, HDA> at device 27.0 (no driver attached)
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
pci3: <ACPI PCI bus> on pcib3
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xd000-0xd0ff mem 0xf0104000-0xf0104fff,0xf0100000-0xf0103fff irq 16 at device 0.0 on pci3
re0: Using 1 MSI-X message
re0: Chip rev. 0x2c800000
re0: MAC rev. 0x00100000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 94:de:80:62:54:ce
pcib4: <ACPI PCI-PCI bridge> irq 17 at device 28.5 on pci0
pci4: <ACPI PCI bus> on pcib4
re1: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xc000-0xc0ff mem 0xf0004000-0xf0004fff,0xf0000000-0xf0003fff irq 17 at device 0.0 on pci4
re1: Using 1 MSI-X message
re1: Chip rev. 0x2c800000
re1: MAC rev. 0x00100000
miibus1: <MII bus> on re1
rgephy1: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus1
rgephy1:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re1: Using defaults for TSO: 65518/35/2048
re1: Ethernet address: 94:de:80:62:54:8f
pcib5: <ACPI PCI-PCI bridge> irq 18 at device 28.6 on pci0
pci5: <ACPI PCI bus> on pcib5
pci5: <network> at device 0.0 (no driver attached)
ehci1: <Intel Panther Point USB 2.0 controller> mem 0xf7f17000-0xf7f173ff irq 23 at device 29.0 on pci0
usbus2: EHCI version 1.0
usbus2 on ehci1
usbus2: 480Mbps High Speed USB v2.0
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
ahci0: <Intel Panther Point AHCI SATA controller> port 0xf0b0-0xf0b7,0xf0a0-0xf0a3,0xf090-0xf097,0xf080-0xf083,0xf060-0xf07f mem 0xf7f16000-0xf7f167ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported
ahciem0: <AHCI enclosure management bridge> on ahci0
acpi_button0: <Power Button> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
acpi_tz1: <Thermal Zone> on acpi0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
ichwd0: <Intel Panther Point watchdog timer> on isa0
ichwd0: ICH WDT present but disabled in BIOS or hardware
device_attach: ichwd0 attach returned 6
ichwd0: <Intel Panther Point watchdog timer> at port 0x430-0x437,0x460-0x47f on isa0
ichwd0: ICH WDT present but disabled in BIOS or hardware
device_attach: ichwd0 attach returned 6
superio0: <ITE IT8728 SuperIO (revision 0x01)> at port 0x2e-0x2f on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
coretemp0: <CPU On-Die Thermal Sensors> on cpu0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
Timecounter "TSC-low" frequency 1701672733 Hz quality 1000
Timecounters tick every 1.000 msec
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
ipfw2 (+ipv6) initialized, divert enabled, nat enabled, default to accept, logging disabled
ugen1.1: <Intel EHCI root HUB> at usbus1
ugen0.1: <Intel XHCI root HUB> at usbus0
ugen2.1: <Intel EHCI root HUB> at usbus2
uhub0 on usbus1
uhub0: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
uhub1 on usbus0
uhub1: <Intel XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
uhub2 on usbus2
uhub2: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus2
Trying to mount root from zfs:boot-pool/ROOT/default []...
ses0 at ahciem0 bus 0 scbus1 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 2.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
da4 at mps0 bus 0 scbus0 target 5 lun 0
da4: <ATA Samsung SSD 850 1B6Q> Fixed Direct Access SPC-4 SCSI device
da4: Serial Number S21CNWAFC19467H
da4: 600.000MB/s transfers
da4: Command Queueing enabled
da4: 953869MB (1953525168 512 byte sectors)
da4: quirks=0x8<4K>
da0 at mps0 bus 0 scbus0 target 0 lun 0
da0: <ATA ST12000VN0008-2Y SC60> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number ZV70GQVJ
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 11444224MB (23437770752 512 byte sectors)
da2 at mps0 bus 0 scbus0 target 3 lun 0
da2: <ATA ST12000VN0008-2Y SC60> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number ZR906K4V
da2: 600.000MB/s transfers
da2: Command Queueing enabled
da2: 11444224MB (23437770752 512 byte sectors)
da1 at mps0 bus 0 scbus0 target 2 lun 0
da1: <ATA ST12000VN0008-2Y SC60> Fixed Direct Access SPC-4 SCSI device
da1: Serial Number ZRT0SK89
da1: 600.000MB/s transfers
da1: Command Queueing enabled
da1: 11444224MB (23437770752 512 byte sectors)
da5 at mps0 bus 0 scbus0 target 7 lun 0
da5: <ATA ST12000VN0008-2Y SC60> Fixed Direct Access SPC-4 SCSI device
da5: Serial Number ZRT17J56
da5: 600.000MB/s transfers
da5: Command Queueing enabled
da5: 11444224MB (23437770752 512 byte sectors)
da6 at mps0 bus 0 scbus0 target 8 lun 0
da6: <ATA ST12000VN0008-2Y SC60> Fixed Direct Access SPC-4 SCSI device
da6: Serial Number ZRT18M7M
da6: 600.000MB/s transfers
da6: Command Queueing enabled
da6: 11444224MB (23437770752 512 byte sectors)
da3 at mps0 bus 0 scbus0 target 4 lun 0
da3: <ATA WD Blue SA510 2. 6100> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number 23422M802397
da3: 600.000MB/s transfers
da3: Command Queueing enabled
da3: 953869MB (1953525168 512 byte sectors)
uhub1: 8 ports with 8 removable, self powered
uhub2: 2 ports with 2 removable, self powered
uhub0: 2 ports with 2 removable, self powered
ugen2.2: <vendor 0x8087 product 0x0024> at usbus2
uhub3 on uhub2
uhub3: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus2
ugen1.2: <vendor 0x8087 product 0x0024> at usbus1
uhub4 on uhub0
uhub4: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus1
Root mount waiting for: usbus1 usbus2
uhub4: 6 ports with 6 removable, self powered
uhub3: 8 ports with 8 removable, self powered
Root mount waiting for: usbus1
ugen1.3: <vendor 0x8087 product 0x07da> at usbus1
ichsmb0: <Intel Panther Point SMBus controller> port 0xf040-0xf05f mem 0xf7f15000-0xf7f150ff irq 18 at device 31.3 on pci0
smbus0: <System Management Bus> on ichsmb0
lo0: link state changed to UP
vmx_modinit: VMX operation disabled by BIOS
module_register_init: MOD_LOAD (vmm, 0xffffffff83604200, 0) error 6
re0: link state changed to UP
re1: link state changed to DOWN
GEOM_MIRROR: Device mirror/swap0 launched (3/3).
GEOM_ELI: Device mirror/swap0.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: accelerated software
CPU: Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz (3403.35-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x306a9  Family=0x6  Model=0x3a  Stepping=9
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7fbae3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  Structured Extended Features=0x281<FSGSBASE,SMEP,ERMS>
  Structured Extended Features3=0x9c000400<MD_CLEAR,IBPB,STIBP,L1DFL,SSBD>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: (disabled in BIOS) PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
re0: link state changed to DOWN
re0: link state changed to UP
hwpmc: SOFT/16/64/0x67<INT,USR,SYS,REA,WRI> TSC/1/64/0x20<REA> IAP/4/48/0x3ff<INT,USR,SYS,EDG,THR,REA,WRI,INV,QUA,PRC> IAF/3/48/0x67<INT,USR,SYS,REA,WRI>
Security policy loaded: MAC/ntpd (mac_ntpd)
bridge0: Ethernet address: 58:9c:fc:10:ff:8f
bridge0: link state changed to UP
re0: promiscuous mode enabled
epair0a: Ethernet address: 02:54:a6:0d:06:0a
epair0b: Ethernet address: 02:54:a6:0d:06:0b
epair0a: link state changed to UP
epair0b: link state changed to UP
epair0a: changing name to 'vnet0.1'
vnet0.1: promiscuous mode enabled
re0: link state changed to DOWN
re0: link state changed to UP
lo0: link state changed to UP
re0: watchdog timeout
re0: link state changed to DOWN
re0: link state changed to UP
epair1a: Ethernet address: 02:17:50:10:e5:0a
epair1b: Ethernet address: 02:17:50:10:e5:0b
epair1a: link state changed to UP
epair1b: link state changed to UP
epair1a: changing name to 'vnet0.2'
epair1b: changing name to 'epair0b'
vnet0.2: promiscuous mode enabled
lo0: link state changed to UP
vnet0.1: link state changed to DOWN
epair0b: link state changed to DOWN
vnet0.2: link state changed to DOWN
epair0b: link state changed to DOWN
re0: link state changed to DOWN
bridge0: link state changed to DOWN
re0: link state changed to UP
bridge0: link state changed to UP
epair0a: Ethernet address: 02:2d:02:cd:b2:0a
epair0b: Ethernet address: 02:2d:02:cd:b2:0b
epair0a: link state changed to UP
epair0b: link state changed to UP
epair0a: changing name to 'vnet0.3'
vnet0.3: promiscuous mode enabled
re0: link state changed to DOWN
lo0: link state changed to UP
re0: link state changed to UP
vnet0.3: link state changed to DOWN
epair0b: link state changed to DOWN
re0: link state changed to DOWN
bridge0: link state changed to DOWN
re0: link state changed to UP
bridge0: link state changed to UP
epair0a: Ethernet address: 02:98:31:55:b3:0a
epair0b: Ethernet address: 02:98:31:55:b3:0b
epair0a: link state changed to UP
epair0b: link state changed to UP
epair0a: changing name to 'vnet0.4'
vnet0.4: promiscuous mode enabled
re0: link state changed to DOWN
lo0: link state changed to UP
re0: link state changed to UP
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Your damaged files are in a snapshot as it seems. Do you have backups / do you know if you need any contents from that specific snapshot?

Probably you cannot delete the files as they are part of a read-only snapshot, I fear you would need to delete the snapshot.

This may solve your problem. You can mount the snapshot via smb and make an additional backup of just that snapshot / compare the snapshot to your current dataset and retrieve all changed data.

See if deleting the snapshot then resolved your issue. What PSU do you have? Is the hardware in your signature current? Your controller is in it mode? What power connectors do you use? Any Y splitters?

If you do not understand the implications of deleting that snapshot, ask here for clarification before you proceed.
 

dr_kiuchi

Cadet
Joined
Mar 17, 2024
Messages
8
Thanks @chuck32. I did have a suspicion given the zpool status output of @manual. Have removed and scrubbing again. Fingers crossed.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Do a memtest afterwards for 24 hours and answer the questions about your power supply.

I suspect you may have a hardware problem going on, RAIDZ2 should have been able to repair the errors during a scrub, if I'm not completely mistaken.
 

dr_kiuchi

Cadet
Joined
Mar 17, 2024
Messages
8
Thanks again @chuck32 - totally overlooked the query to the PSU. It's a corsair 650 - no splitters, only direct.
Hardware in the signature is current (although dated, bought it all about a decade ago, aside from the new drives and controller) so there's that.
As for the controller, good point. I need to check the bios after scrub is complete. I think it is in IT mode but need to verify.

Will report back, and thanks again for the guidance, much appreciated.
 

dr_kiuchi

Cadet
Joined
Mar 17, 2024
Messages
8
Quick update:
  • Confirmed the controller is IT mode, updated the firmware.
  • Deleted the snaphot, rescrubbed, and ran a zpool status -v. This time more files found corrupted, (the direct media that was referenced in the snapshot) including my plex library db. Not worried - can replace.
    • Attempted to restore the files with my old pool, and one of the disks literally caught on fire. Fun times. (note - done in a clean-space with anti-static wristband and precautions). This was a first in my 3 decades of IT.
      • IMG_1772.jpeg
  • Currently deleting corrupt files and rescrubbing. Upon this will run memtest & potentially replace.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
This time more files found corrupted, (the direct media that was referenced in the snapshot) including my plex library db. Not worried - can replace
In another snapshot this time? That's why I suspect a hardware problem, a scrub should have been able to repair the errors and now there are more.
Attempted to restore the files with my old pool, and one of the disks literally caught on fire. Fun times.
What the heck?!

Was this connected to your server you currently use?

I'm only dabbling with computers for almost 2 decades, and haven't seen that one either. I'm curious what others will say, but I followed your link from the PSU. It was released 2011, is it safe to assume you've had this for about 10 years? I would think about retiring it proactively. It may be not related at all, but that's what I would do.

So to summarize, including that drive you had 3 drives fail on you in short time?
Can you explain the heat issues from your OP a bit more?
 

dr_kiuchi

Cadet
Joined
Mar 17, 2024
Messages
8
| Was this connected to your server you currently use?
  • Yep - fortunately nothing else (seemingly?) was affected based on smoke tests so far.
| In another snapshot this time?
  • Not the snapshot, but the raw media files themselves that the previous status referenced.
| I would think about retiring it proactively.
  • +100 on upgrading at this point, looking into options. Was happy with how long everything lasted but at this juncture thinking it's more of a headache to upkeep vs. starting fresh.
In sum: I initially had 2/5 drives that were degraded due to heat issues when migrating to a larger pool. More specifically, it was the
190 Airflow_Temperature_Cel 0×0022 status that showed up as a failures. I added some fans to the rig and monitored temp, which seems to be fine.
I replaced both and resilvered, and upon scrubbing, ended up with all drives in a degraded state, which was completely surprising.

Currently scrubbing again. Will see if it makes a difference, otherwise looking to pretty much replace everything. And on top of that move from CORE to SCALE, which I expect will be even more fun.
 

dr_kiuchi

Cadet
Joined
Mar 17, 2024
Messages
8
Update:
Removed all of the files with permanent errors, rescrubbed, finally making some progress and did not get any errors (and no new fires this time). Running memtest now. Will close the thread if all looks good.

Code:
root@truenas[~]# zpool status -v
  pool: KiuchiPool
 state: ONLINE
  scan: scrub repaired 344K in 03:42:45 with 0 errors on Thu Mar 21 01:44:04 2024
config:

    NAME                                            STATE     READ WRITE CKSUM
    KiuchiPool                                      ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/7c875ab1-db1a-11ee-abaf-94de806254ce  ONLINE       0     0     0
        gptid/013ed923-e0df-11ee-ad17-94de806254ce  ONLINE       0     0     0
        gptid/7cb37fd5-db1a-11ee-abaf-94de806254ce  ONLINE       0     0     0
        gptid/7cbedb7c-db1a-11ee-abaf-94de806254ce  ONLINE       0     0     0
        gptid/ed064fd0-e0de-11ee-ad17-94de806254ce  ONLINE       0     0     0

errors: No known data errors
 

dr_kiuchi

Cadet
Joined
Mar 17, 2024
Messages
8
Update and closure:
  • I removed the files that showed permanent errors, rescrubbed 2x, everything came up green.
  • I ran memtest with no issues.
  • Migrated to Cobia
  • Still have some BIOS issues (not related to TrueNas) in terms of booting the wrong disk - and at this point plan to upgrade the rest of the system hardware.
Confirmed system stabilization, errors cleared, and the pool is in a normal status again. Kudos for the feedback @chuck32, thank you.

Kind Regards -
Screenshot 2024-03-24 at 2.47.02 PM.png
 
Top