Pool keeps degrading

buswedg · Sep 5, 2022

I keep getting read/ write errors on multiple drives in my TrueNAS Scale setup.

Really doing my head in now. I've replaced pretty much every component trying to solve this.

Starting to think there is something wrong with TrueNAS Scale itself, possibly when working with large capacity drives. Per here.

Anyway, in the interest in fault finding this, here is my info:

------------------------------------------------
Setup

OS: TrueNAS-SCALE-22.02.3
CPU: Intel(R) Xeon(R) W-2223 CPU @ 3.60GHz
MB: Supermicro MBD-X11SRL-F-O
RAM: 128GB (4x) Crucial Technology 32GB DDR4 PC4-21300 2666MHz RDIMM CT32G4RFD4266 ECC
HDDS: 6x WD Gold Enterprise Class SATA HDD 16TB
ZFS: RAID2Z with above drives
HBA: LSI 9208-8i 6Gbps SAS PCIe 3.0 HBA P20 IT Mode (have a Noctua 40mm fan on heatsink)
Expander: Intel RAID Expander RES2SV240 (have a Noctua 40mm fan on heatsink)
Cache: NVME 512GB Samsung 960
PSU: Corsair HX1200

------------------------------------------------
What happens?

Every few days, I wake up to a bunch of read/write errors on multiple drives. Resulting in a degraded pool. Email alert I work up to this morning for example:

Code:

Pool storage-pool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
Disk WDC_WD161KRYZ-01AGBB0 3XH0DXRT is DEGRADED
Disk WDC_WD161KRYZ-01AGBB0 3FKEJZZT is FAULTED
Disk WDC_WD161KRYZ-01AGBB0 2NGUUX8H is FAULTED

I usually just zpool clear the array, and let it resliver. And that sometimes generates checksum errors on multiple drives too.

I think its related to heavy IO activity. As the nights it happens, I'm usually moving a bunch of data or doing something intensive.

------------------------------------------------
What have you tried?

I've replaced literally every hardware component except the case and HDDs at this point:

1- I was running some left over consumer parts (CPU/MB/Non-ECC RAM), which I figured may be causing the issue. So I replaced with the above noted parts.
2- I've tried with and without the RAID expander card.
3- I've tried adding addl. 40mm fans onto the HBA and expander card (plus addl. 140mm fans inside the case), as I thought heat may be an issue.
4- I've replaced every SAS and power cable, including those SAS cables between the HBA and expander, as well as those going to the drives.
5- I even took advantage of a return window to replace the PSU.
6- Only very recently added the NVME as cache for the pool, so I know that isn't the issue.
7- Oh, and yes, I've tried installing TrueNAS Scale fresh too.

So, at this point, I feel like something is either wrong with the drives, which I can't see in smartctl (these drives are all 2 months old at this point). Or, TrueNAS itself.

------------------------------------------------
Here are the data protection tasks I have scheduled:

Code:

scrub tasks:
pool (0 0 * mon)

smart tests:
all disks SHORT (0 0 * * tue,thu,sat)
all disks LONG (0 0 * * sun)

------------------------------------------------
dmesg

Getting a heap of these:

Code:

[822301.399301] sd 0:0:1:0: attempting task abort!scmd(0x00000000881cddaf), outstanding for 31820 ms & timeout 30000 ms
[822301.410926] sd 0:0:1:0: [sdc] tag#8578 CDB: Read(16) 88 00 00 00 00 01 bc d5 bb 70 00 00 08 00 00 00
[822301.410928] scsi target0:0:1: handle(0x000b), sas_address(0x5001e677bd4c2fe9), phy(9)
[822301.410930] scsi target0:0:1: enclosure logical id(0x5001e677bd4c2fff), slot(9)
[822301.410932] sd 0:0:1:0: No reference found at driver, assuming scmd(0x00000000881cddaf) might have completed
[822301.410934] sd 0:0:1:0: task abort: SUCCESS scmd(0x00000000881cddaf)
[822301.457006] sd 0:0:1:0: [sdc] tag#8578 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[822301.467936] sd 0:0:1:0: [sdc] tag#8578 CDB: Read(16) 88 00 00 00 00 01 bc d5 bb 70 00 00 08 00 00 00
[822301.467939] blk_update_request: I/O error, dev sdc, sector 7463091056 op 0x0:(READ) flags 0x700 phys_seg 32 prio class 0
[822301.467944] zio pool=storage-pool vdev=/dev/disk/by-partuuid/89b557d9-cc55-463c-9124-3765cff9aaac error=5 type=1 offset=3818955071488 size=1048576 flags=40080c80

------------------------------------------------
zpool status (after running a clear on the pool this morning):

Code:

  pool: storage-pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 428G in 00:38:29 with 0 errors on Mon Sep  5 11:54:58 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        storage-pool                              ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            1766c859-9231-4127-ad35-882937845b76  ONLINE       0     0     0
            89b557d9-cc55-463c-9124-3765cff9aaac  ONLINE       0     0     0
            4a07f2eb-442c-430d-9db6-0e5fd80302a9  ONLINE       0     0     2
            c6287e21-49e0-4ffe-86a1-78a33bbbec70  ONLINE       0     0     0
            1c817e68-35de-4b8c-be59-4e98fc1fc9fe  ONLINE       0     0     2
            28dba822-3f63-4e2c-82ad-8e4467bc81a1  ONLINE       0     0     0
        cache
          f5456732-9153-44fb-977b-fe5f1021b959    ONLINE       0     0     0

errors: No known data errors

buswedg · Sep 5, 2022

Code:

-----------------------------------------------------------------------------------------------------

root@truenas[~]# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Gold
Device Model:     WDC WD161KRYZ-01AGBB0
Serial Number:    3RJ8KEDA
LU WWN Device Id: 5 000cca 2c3e0136a
Firmware Version: 01.01H01
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  5 11:39:32 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 245) Self-test routine in progress...
                                        50% of test remaining.
Total time to complete Offline
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1709) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   149   149   054    Pre-fail  Offline      -       44
  3 Spin_Up_Time            0x0007   084   084   001    Pre-fail  Always       -       331 (Average 339)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1478
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       106
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       106
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       31 (Min/Max 24/42)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1418         -
# 2  Short offline       Completed without error       00%      1370         -
# 3  Extended offline    Completed without error       00%      1345         -
# 4  Short offline       Completed without error       00%      1250         -
# 5  Short offline       Completed without error       00%      1223         -
# 6  Short offline       Completed without error       00%      1207         -
# 7  Extended offline    Interrupted (host reset)      70%      1180         -
# 8  Short offline       Completed without error       00%      1135         -
# 9  Short offline       Completed without error       00%      1087         -
#10  Short offline       Completed without error       00%      1041         -
#11  Short offline       Completed without error       00%       972         -
#12  Short offline       Completed without error       00%       923         -
#13  Short offline       Completed without error       00%       875         -
#14  Short offline       Completed without error       00%       840         -
#15  Extended offline    Completed without error       00%       796         -
#16  Short offline       Completed without error       00%       772         -
#17  Extended offline    Interrupted (host reset)      90%       770         -
#18  Short offline       Completed without error       00%       760         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

buswedg · Sep 5, 2022

Code:

-----------------------------------------------------------------------------------------------------

root@truenas[~]# smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Gold
Device Model:     WDC WD161KRYZ-01AGBB0
Serial Number:    2NGUUX8H
LU WWN Device Id: 5 000cca 2b5cbbe8c
Firmware Version: 01.01H01
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  5 11:40:39 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  38) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1733) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   148   148   054    Pre-fail  Offline      -       48
  3 Spin_Up_Time            0x0007   083   083   001    Pre-fail  Always       -       343 (Average 330)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       15
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1478
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       106
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       106
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       31 (Min/Max 23/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      60%      1469         -
# 2  Short offline       Completed without error       00%      1418         -
# 3  Short offline       Completed without error       00%      1370         -
# 4  Extended offline    Completed without error       00%      1345         -
# 5  Short offline       Completed without error       00%      1250         -
# 6  Short offline       Completed without error       00%      1223         -
# 7  Short offline       Completed without error       00%      1207         -
# 8  Extended offline    Interrupted (host reset)      70%      1180         -
# 9  Short offline       Completed without error       00%      1135         -
#10  Short offline       Completed without error       00%      1087         -
#11  Short offline       Completed without error       00%      1041         -
#12  Short offline       Completed without error       00%       972         -
#13  Short offline       Completed without error       00%       923         -
#14  Short offline       Completed without error       00%       875         -
#15  Short offline       Completed without error       00%       840         -
#16  Extended offline    Completed without error       00%       796         -
#17  Short offline       Completed without error       00%       772         -
#18  Short offline       Completed without error       00%       760         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

buswedg · Sep 5, 2022

Code:

-----------------------------------------------------------------------------------------------------

root@truenas[~]# smartctl -a /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Gold
Device Model:     WDC WD161KRYZ-01AGBB0
Serial Number:    3XH0DXRT
LU WWN Device Id: 5 000cca 285ce4807
Firmware Version: 01.01H01
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  5 11:41:17 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  38) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1689) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   149   149   054    Pre-fail  Offline      -       44
  3 Spin_Up_Time            0x0007   084   084   001    Pre-fail  Always       -       338 (Average 326)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1478
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       105
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       105
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       31 (Min/Max 24/43)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      60%      1470         -
# 2  Short offline       Completed without error       00%      1418         -
# 3  Short offline       Completed without error       00%      1370         -
# 4  Extended offline    Completed without error       00%      1345         -
# 5  Short offline       Completed without error       00%      1250         -
# 6  Short offline       Completed without error       00%      1223         -
# 7  Short offline       Completed without error       00%      1207         -
# 8  Extended offline    Interrupted (host reset)      70%      1180         -
# 9  Short offline       Completed without error       00%      1135         -
#10  Short offline       Completed without error       00%      1087         -
#11  Short offline       Completed without error       00%      1041         -
#12  Short offline       Completed without error       00%       972         -
#13  Short offline       Completed without error       00%       923         -
#14  Short offline       Completed without error       00%       875         -
#15  Short offline       Completed without error       00%       840         -
#16  Extended offline    Completed without error       00%       796         -
#17  Short offline       Completed without error       00%       772         -
#18  Short offline       Completed without error       00%       760         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

buswedg · Sep 5, 2022

Code:

-----------------------------------------------------------------------------------------------------

root@truenas[~]# smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Gold
Device Model:     WDC WD161KRYZ-01AGBB0
Serial Number:    4ZG86LJV
LU WWN Device Id: 5 000cca 2a6c3bbbc
Firmware Version: 01.01H01
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  5 11:41:54 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 244) Self-test routine in progress...
                                        40% of test remaining.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1685) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   149   149   054    Pre-fail  Offline      -       44
  3 Spin_Up_Time            0x0007   084   084   001    Pre-fail  Always       -       340 (Average 327)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1478
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       104
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       104
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       31 (Min/Max 23/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1418         -
# 2  Short offline       Completed without error       00%      1370         -
# 3  Extended offline    Completed without error       00%      1343         -
# 4  Short offline       Completed without error       00%      1250         -
# 5  Short offline       Completed without error       00%      1223         -
# 6  Short offline       Completed without error       00%      1207         -
# 7  Extended offline    Interrupted (host reset)      70%      1180         -
# 8  Short offline       Completed without error       00%      1135         -
# 9  Short offline       Completed without error       00%      1087         -
#10  Short offline       Completed without error       00%      1041         -
#11  Short offline       Completed without error       00%       972         -
#12  Short offline       Completed without error       00%       923         -
#13  Short offline       Completed without error       00%       875         -
#14  Short offline       Completed without error       00%       838         -
#15  Extended offline    Completed without error       00%       795         -
#16  Short offline       Completed without error       00%       772         -
#17  Short offline       Completed without error       00%       760         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

buswedg · Sep 5, 2022

Code:

-----------------------------------------------------------------------------------------------------

root@truenas[~]# smartctl -a /dev/sde
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Gold
Device Model:     WDC WD161KRYZ-01AGBB0
Serial Number:    3FKEJZZT
LU WWN Device Id: 5 000cca 2c2f06f26
Firmware Version: 01.01H01
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  5 11:42:31 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  38) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1746) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   148   148   054    Pre-fail  Offline      -       48
  3 Spin_Up_Time            0x0007   085   085   001    Pre-fail  Always       -       320 (Average 301)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1478
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       105
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       105
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       31 (Min/Max 24/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      60%      1469         -
# 2  Short offline       Completed without error       00%      1418         -
# 3  Short offline       Completed without error       00%      1370         -
# 4  Extended offline    Completed without error       00%      1347         -
# 5  Short offline       Completed without error       00%      1250         -
# 6  Short offline       Completed without error       00%      1223         -
# 7  Short offline       Completed without error       00%      1207         -
# 8  Extended offline    Interrupted (host reset)      70%      1180         -
# 9  Short offline       Completed without error       00%      1135         -
#10  Short offline       Completed without error       00%      1087         -
#11  Short offline       Completed without error       00%      1041         -
#12  Short offline       Completed without error       00%       972         -
#13  Short offline       Completed without error       00%       923         -
#14  Short offline       Completed without error       00%       875         -
#15  Short offline       Completed without error       00%       863         -
#16  Extended offline    Completed without error       00%       796         -
#17  Short offline       Completed without error       00%       772         -
#18  Short offline       Completed without error       00%       760         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

buswedg · Sep 5, 2022

Code:

-----------------------------------------------------------------------------------------------------

root@truenas[~]# smartctl -a /dev/sdf
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Gold
Device Model:     WDC WD161KRYZ-01AGBB0
Serial Number:    3XGVATSU
LU WWN Device Id: 5 000cca 285cbfa2f
Firmware Version: 01.01H01
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  5 11:43:00 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 245) Self-test routine in progress...
                                        50% of test remaining.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1736) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   148   148   054    Pre-fail  Offline      -       48
  3 Spin_Up_Time            0x0007   083   083   001    Pre-fail  Always       -       348 (Average 335)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       15
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1478
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       106
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       106
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       31 (Min/Max 24/45)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1418         -
# 2  Short offline       Completed without error       00%      1370         -
# 3  Extended offline    Completed without error       00%      1346         -
# 4  Short offline       Completed without error       00%      1250         -
# 5  Short offline       Completed without error       00%      1223         -
# 6  Short offline       Completed without error       00%      1207         -
# 7  Extended offline    Interrupted (host reset)      70%      1180         -
# 8  Short offline       Completed without error       00%      1135         -
# 9  Short offline       Completed without error       00%      1087         -
#10  Short offline       Completed without error       00%      1041         -
#11  Short offline       Completed without error       00%       972         -
#12  Short offline       Completed without error       00%       923         -
#13  Short offline       Completed without error       00%       875         -
#14  Short offline       Completed without error       00%       840         -
#15  Extended offline    Completed without error       00%       796         -
#16  Short offline       Completed without error       00%       772         -
#17  Short offline       Completed without error       00%       760         -
#18  Extended offline    Completed without error       00%       701         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

HarryMuscle · Sep 5, 2022

It would be pretty bad luck for three of the six drives to be the culprits but it is possible. Is it always the same three drives that are flagged as having issues?

buswedg · Sep 5, 2022

No, should have mentioned that it isn't the same drives coming up with read/write errors. On one instance, I saw errors on 5 drives I'm quite sure.

buswedg · Sep 5, 2022

I did test these drives pretty thoroughly with long smart tests and badblocks before putting them into an array. No issues came up during that process. But as I said, the issue does seems to be IO related. At least subjectively.

buswedg · Sep 6, 2022

anyone?

HoneyBadger · Sep 6, 2022

I can sympathize with the frustration here after looking at the laundry list of parts you've replaced. You've checked cables and eliminated the backplane (which are the normal source of CKSUM errors) but I do have to ask if you've looked at the HBA itself as a point of failure.

Are the errors flooding the console always failed Read, or are they sometimes failed Write?

buswedg · Sep 6, 2022

HoneyBadger said:
I can sympathize with the frustration here after looking at the laundry list of parts you've replaced. You've checked cables and eliminated the backplane (which are the normal source of CKSUM errors) but I do have to ask if you've looked at the HBA itself as a point of failure.

Are the errors flooding the console always failed Read, or are they sometimes failed Write?

I replaced the HBA too. Bought a replacement from ArtoftheServer. I actually thought the new HBA might be related to the issue at one point, as it seemed to have been more apparent after making the switch. But I think likelihood that I've had two bad HBAs is pretty low. I also went through the logs with him -- and he saw indicators which suggested the HBA wasn't at fault... the last recommendation I got was to try an OS rather than TrueNAS, e.g. OpenZFS on CentOS.

On your second question -- It's both read and write.

Stux · Sep 6, 2022

I’d try moving as many of the drives to motherboard sata ports as possible. See if those drives quit throwing errors.

Also, I don’t think you explained how you’re powering the drives…

buswedg · Sep 6, 2022

Stux said:
I’d try moving as many of the drives to motherboard sata ports as possible. See if those drives quit throwing errors.

Also, I don’t think you explained how you’re powering the drives…

Alright, I can give that a shot. Best to disconnect the array and move all drives at once? I'm pretty sure my board has six headers, so that could be done.

And on the power -- I noted the consumer oriented PSU I'm using above (corsair). I'm putting two drives on each sata power cable/connection.

diogen · Sep 7, 2022

I had something similar (the numbers in the checksum column were much bigger) on an older system a few years ago.
Errors were popping up while copying was in progress. It turned out to be RAM...

Try taking out half the RAM (just as a test); re-seat the sticks while doing it...

Just to clarify: why do you need the RES2SV240 expander at all?

buswedg · Sep 7, 2022

diogen said:
I had something similar (the numbers in the checksum column were much bigger) on an older system a few years ago.
Errors were popping up while copying was in progress. It turned out to be RAM...

Try taking out half the RAM (just as a test); re-seat the sticks while doing it...

Just to clarify: why do you need the RES2SV240 expander at all?

The checksums only appear when I do a zpool status clear. I normally just run another clear and it gets rid of them.

I also saw the same problem (i.e. read/write errors, followed by clear which generates chksums) using a completely different set of RAM (non-ECC) on a consumer mboard. I also thought RAM might be a problem here, so I just bought the ECC set. Like I said, I've basically replaced everything except the case and the drives.

On the expander -- I was planning on adding another 6 drives next year and creating a second ZFS array. I've got another hardware raid setup which is ageing. But at this point, there is no way I'm putting even more data on this TrueNAS setup. Note that I've tried this with and without that expander card. Same issue.

I'm going to disconnect the array and move the drives to the six mboard sata headers this weekend. At least that'll isolate if I've been extremely unlucky and got two dud HBA's.

buswedg · Sep 7, 2022

Actually, could this be related? https://www.reddit.com/r/linux/comments/5vcbyo/some_sata_drives_have_their_timeouts_disabled_by/

I ran the below cmd and all my drives have SCT Error Recovery Control (read/write) set to disabled:

for drive in /sys/block/sd*; do drive="/dev/$(basename $drive)"; echo "$drive:"; smartctl -l scterc $drive; done

buswedg · Sep 7, 2022

buswedg said:
Actually, could this be related? https://www.reddit.com/r/linux/comments/5vcbyo/some_sata_drives_have_their_timeouts_disabled_by/

I ran the below cmd and all my drives have SCT Error Recovery Control (read/write) set to disabled:

for drive in /sys/block/sd*; do drive="/dev/$(basename $drive)"; echo "$drive:"; smartctl -l scterc $drive; done

Not sure if it is related to my issue. But I've set SCT Error Recovery to 70 sec for all drives in the array. The setting isn't saved on cold restart. So, I'm running the below on startup via SCALE's interface:

Code:

#!/bin/sh
readsetting=70
writesetting=70

drives="sda sdb sdc sdd sde sdf"

set_erc()
{
  echo "Drive: /dev/$1"
  smartctl -q silent -l scterc,"${readsetting}","${writesetting}" /dev/"$1"
  smartctl -l scterc /dev/"$1" | grep "SCT\|Write\|Read"
}

for drive in $drives; do
  set_erc "$drive"
done

Arwen · Sep 7, 2022

Excellent detective work. That could be it. Note that using 70 is really 7 seconds, (value is in 1/10 of a second), and that is what you want. If the drive can't solve the problem quickly, let ZFS do it.

Had a co-worker that kept having his drives drop from Linux MD-RAID. He was using desktop drives which can take greater than 1 minute to attempt sector recoveries. I suggested WD-Reds, (before SMR...), and those solved his problem. The WD-Reds, (back then), had the SCT Error Recovery set to a more limited number, like 7 seconds. Thus, Linux MD-RAID no longer thought the drive was dead / dying.

Important Announcement for the TrueNAS Community.

Pool keeps degrading

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Contributor

Explorer

Explorer

Explorer

actually does care

Explorer

MVP

Explorer

Explorer

Explorer

Explorer

Explorer

MVP

Similar threads