I keep getting read/ write errors on multiple drives in my TrueNAS Scale setup.
Really doing my head in now. I've replaced pretty much every component trying to solve this.
Starting to think there is something wrong with TrueNAS Scale itself, possibly when working with large capacity drives. Per here.
Anyway, in the interest in fault finding this, here is my info:
------------------------------------------------
Setup
OS: TrueNAS-SCALE-22.02.3
CPU: Intel(R) Xeon(R) W-2223 CPU @ 3.60GHz
MB: Supermicro MBD-X11SRL-F-O
RAM: 128GB (4x) Crucial Technology 32GB DDR4 PC4-21300 2666MHz RDIMM CT32G4RFD4266 ECC
HDDS: 6x WD Gold Enterprise Class SATA HDD 16TB
ZFS: RAID2Z with above drives
HBA: LSI 9208-8i 6Gbps SAS PCIe 3.0 HBA P20 IT Mode (have a Noctua 40mm fan on heatsink)
Expander: Intel RAID Expander RES2SV240 (have a Noctua 40mm fan on heatsink)
Cache: NVME 512GB Samsung 960
PSU: Corsair HX1200
------------------------------------------------
What happens?
Every few days, I wake up to a bunch of read/write errors on multiple drives. Resulting in a degraded pool. Email alert I work up to this morning for example:
I usually just zpool clear the array, and let it resliver. And that sometimes generates checksum errors on multiple drives too.
I think its related to heavy IO activity. As the nights it happens, I'm usually moving a bunch of data or doing something intensive.
------------------------------------------------
What have you tried?
I've replaced literally every hardware component except the case and HDDs at this point:
1- I was running some left over consumer parts (CPU/MB/Non-ECC RAM), which I figured may be causing the issue. So I replaced with the above noted parts.
2- I've tried with and without the RAID expander card.
3- I've tried adding addl. 40mm fans onto the HBA and expander card (plus addl. 140mm fans inside the case), as I thought heat may be an issue.
4- I've replaced every SAS and power cable, including those SAS cables between the HBA and expander, as well as those going to the drives.
5- I even took advantage of a return window to replace the PSU.
6- Only very recently added the NVME as cache for the pool, so I know that isn't the issue.
7- Oh, and yes, I've tried installing TrueNAS Scale fresh too.
So, at this point, I feel like something is either wrong with the drives, which I can't see in smartctl (these drives are all 2 months old at this point). Or, TrueNAS itself.
------------------------------------------------
Here are the data protection tasks I have scheduled:
------------------------------------------------
dmesg
Getting a heap of these:
------------------------------------------------
zpool status (after running a clear on the pool this morning):
Really doing my head in now. I've replaced pretty much every component trying to solve this.
Starting to think there is something wrong with TrueNAS Scale itself, possibly when working with large capacity drives. Per here.
Anyway, in the interest in fault finding this, here is my info:
------------------------------------------------
Setup
OS: TrueNAS-SCALE-22.02.3
CPU: Intel(R) Xeon(R) W-2223 CPU @ 3.60GHz
MB: Supermicro MBD-X11SRL-F-O
RAM: 128GB (4x) Crucial Technology 32GB DDR4 PC4-21300 2666MHz RDIMM CT32G4RFD4266 ECC
HDDS: 6x WD Gold Enterprise Class SATA HDD 16TB
ZFS: RAID2Z with above drives
HBA: LSI 9208-8i 6Gbps SAS PCIe 3.0 HBA P20 IT Mode (have a Noctua 40mm fan on heatsink)
Expander: Intel RAID Expander RES2SV240 (have a Noctua 40mm fan on heatsink)
Cache: NVME 512GB Samsung 960
PSU: Corsair HX1200
------------------------------------------------
What happens?
Every few days, I wake up to a bunch of read/write errors on multiple drives. Resulting in a degraded pool. Email alert I work up to this morning for example:
Code:
Pool storage-pool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. The following devices are not healthy: Disk WDC_WD161KRYZ-01AGBB0 3XH0DXRT is DEGRADED Disk WDC_WD161KRYZ-01AGBB0 3FKEJZZT is FAULTED Disk WDC_WD161KRYZ-01AGBB0 2NGUUX8H is FAULTED
I usually just zpool clear the array, and let it resliver. And that sometimes generates checksum errors on multiple drives too.
I think its related to heavy IO activity. As the nights it happens, I'm usually moving a bunch of data or doing something intensive.
------------------------------------------------
What have you tried?
I've replaced literally every hardware component except the case and HDDs at this point:
1- I was running some left over consumer parts (CPU/MB/Non-ECC RAM), which I figured may be causing the issue. So I replaced with the above noted parts.
2- I've tried with and without the RAID expander card.
3- I've tried adding addl. 40mm fans onto the HBA and expander card (plus addl. 140mm fans inside the case), as I thought heat may be an issue.
4- I've replaced every SAS and power cable, including those SAS cables between the HBA and expander, as well as those going to the drives.
5- I even took advantage of a return window to replace the PSU.
6- Only very recently added the NVME as cache for the pool, so I know that isn't the issue.
7- Oh, and yes, I've tried installing TrueNAS Scale fresh too.
So, at this point, I feel like something is either wrong with the drives, which I can't see in smartctl (these drives are all 2 months old at this point). Or, TrueNAS itself.
------------------------------------------------
Here are the data protection tasks I have scheduled:
Code:
scrub tasks: pool (0 0 * mon) smart tests: all disks SHORT (0 0 * * tue,thu,sat) all disks LONG (0 0 * * sun)
------------------------------------------------
dmesg
Getting a heap of these:
Code:
[822301.399301] sd 0:0:1:0: attempting task abort!scmd(0x00000000881cddaf), outstanding for 31820 ms & timeout 30000 ms [822301.410926] sd 0:0:1:0: [sdc] tag#8578 CDB: Read(16) 88 00 00 00 00 01 bc d5 bb 70 00 00 08 00 00 00 [822301.410928] scsi target0:0:1: handle(0x000b), sas_address(0x5001e677bd4c2fe9), phy(9) [822301.410930] scsi target0:0:1: enclosure logical id(0x5001e677bd4c2fff), slot(9) [822301.410932] sd 0:0:1:0: No reference found at driver, assuming scmd(0x00000000881cddaf) might have completed [822301.410934] sd 0:0:1:0: task abort: SUCCESS scmd(0x00000000881cddaf) [822301.457006] sd 0:0:1:0: [sdc] tag#8578 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [822301.467936] sd 0:0:1:0: [sdc] tag#8578 CDB: Read(16) 88 00 00 00 00 01 bc d5 bb 70 00 00 08 00 00 00 [822301.467939] blk_update_request: I/O error, dev sdc, sector 7463091056 op 0x0:(READ) flags 0x700 phys_seg 32 prio class 0 [822301.467944] zio pool=storage-pool vdev=/dev/disk/by-partuuid/89b557d9-cc55-463c-9124-3765cff9aaac error=5 type=1 offset=3818955071488 size=1048576 flags=40080c80
------------------------------------------------
zpool status (after running a clear on the pool this morning):
Code:
pool: storage-pool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: resilvered 428G in 00:38:29 with 0 errors on Mon Sep 5 11:54:58 2022 config: NAME STATE READ WRITE CKSUM storage-pool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 1766c859-9231-4127-ad35-882937845b76 ONLINE 0 0 0 89b557d9-cc55-463c-9124-3765cff9aaac ONLINE 0 0 0 4a07f2eb-442c-430d-9db6-0e5fd80302a9 ONLINE 0 0 2 c6287e21-49e0-4ffe-86a1-78a33bbbec70 ONLINE 0 0 0 1c817e68-35de-4b8c-be59-4e98fc1fc9fe ONLINE 0 0 2 28dba822-3f63-4e2c-82ad-8e4467bc81a1 ONLINE 0 0 0 cache f5456732-9153-44fb-977b-fe5f1021b959 ONLINE 0 0 0 errors: No known data errors
Last edited: