Uncorrectable I/O failure while resilvering

Alibuba

Cadet
Joined
Jan 30, 2021
Messages
6
Hi,

I'm running TrueNAS 12 (not sure of the exact release, but uname -a says 12.2-RELEASE-p12).

TrueNAS is running on a VM hosted by ESXi. The disks (12pcs of 3TB WD30EFRX disks and 4pcs of 4TB WD40EFRX) are connected to LSI2008 HBAs which are passed through to TrueNAS. The disks are configured to a single raidz3 pool consisting of 12 disks. The rest of the disks are spares in the pool.

Everything has been running without a hitch for a couple of years.

Yesterday an active disk failed SMART test in my raidz3 pool. I detached one of the spares from the pool, took the failed drive offline, and replaced if with the detached spare, and resilvering begun as expected.

However in the morning the pool encountered an I/O failure and the resilvering process was halted.

I'm unable to get timestamps from dmesg output, but from what I can gather, it looks like there was (or is) an issue with one of the LSI HBAs.

Here's (probably) the relevant output from dmesg

Code:
    (da3:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 136 Command timeout on target 2(0x000d) 60000 set, 60.67265958 elapsed
mps0: Sending abort to target 2 for SMID 136
    (da3:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 136 Aborting command 0xfffffe00e3d0b6c0
    (da16:mps1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 795 Command timeout on target 7(0x0009) 60000 set, 60.67633423 elapsed
mps1: Sending abort to target 7 for SMID 795
    (da16:mps1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 795 Aborting command 0xfffffe00e4042c48
    (da5:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 257 Command timeout on target 4(0x000b) 60000 set, 60.67919296 elapsed
mps0: Sending abort to target 4 for SMID 257
    (da5:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 257 Aborting command 0xfffffe00e3d15958
    (da6:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1644 Command timeout on target 5(0x0010) 60000 set, 60.68200370 elapsed
mps0: Sending abort to target 5 for SMID 1644
    (da6:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1644 Aborting command 0xfffffe00e3d8a120
    (da12:mps1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1108 Command timeout on target 3(0x000d) 60000 set, 60.68640428 elapsed
mps1: Sending abort to target 3 for SMID 1108
    (da12:mps1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1108 Aborting command 0xfffffe00e405d0e0
mps1: mpssas_action_scsiio: Freezing devq for target ID 3
(da12:mps1:0:3:0): READ(10). CDB: 28 00 bb 9d 05 d0 00 00 08 00
(da12:mps1:0:3:0): CAM status: CAM subsystem is busy
(da12:mps1:0:3:0): Retrying command, 3 more tries remain
mps0: Controller reported scsi ioc terminated tgt 5 SMID 1343 loginfo 31130000
mps0: Controller reported scsi ioc terminated tgt 5 SMID 329 loginfo 31130000
mps0: Controller reported scsi ioc terminated tgt 5 SMID 881 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 5 SMID 1414 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 5 SMID 251 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 5 SMID 122 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 5 SMID 795 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 5 SMID 2083 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 5 SMID 1114 loginfo 31140000
(da6:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
mps0: (da6:mps0:0:5:0): CAM status: Command timeout
(da6:mps0:0:5:0): Retrying command, 0 more tries remain
Controller reported scsi ioc terminated tgt 5 SMID 2019 loginfo 31140000
(da6:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
mps0: (da6:mps0:0:5:0): CAM status: CCB request completed with an error
Finished abort recovery for target 5
(da6:mps0:0:5:0): Retrying command, 0 more tries remain
mps0: Unfreezing devq for target ID 5
(da6:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da6:mps0:0:5:0): CAM status: CCB request completed with an error
(da6:mps0:0:5:0): Retrying command, 0 more tries remain
(da6:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 0a 7c c8 30 00 00 00 28 00 00
(da6:mps0:0:5:0): CAM status: CCB request completed with an error
(da6:mps0:0:5:0): Retrying command, 3 more tries remain
(da6:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 0a 7c ab 38 00 00 00 08 00 00
(da6:mps0:0:5:0): CAM status: CCB request completed with an error
(da6:mps0:0:5:0): Retrying command, 3 more tries remain
(da6:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 0a 7c a1 f0 00 00 00 08 00 00
(da6:mps0:0:5:0): CAM status: CCB request completed with an error
(da6:mps0:0:5:0): Retrying command, 3 more tries remain
(da6:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 0a 7a 67 c8 00 00 00 08 00 00
(da6:mps0:0:5:0): CAM status: CCB request completed with an error
(da6:mps0:0:5:0): Retrying command, 3 more tries remain
(da6:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 0a 7b c0 f0 00 00 00 08 00 00
(da6:mps0:0:5:0): CAM status: CCB request completed with an error
(da6:mps0:0:5:0): Retrying command, 3 more tries remain
(da6:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 0a 78 e1 50 00 00 00 08 00 00
(da6:mps0:0:5:0): CAM status: CCB request completed with an error
(da6:mps0:0:5:0): Retrying command, 3 more tries remain
(da6:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 0a 7c f2 88 00 00 00 10 00 00
(da6:mps0:0:5:0): CAM status: CCB request completed with an error
(da6:mps0:0:5:0): Retrying command, 3 more tries remain
(da6:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 0a 78 54 d8 00 00 00 08 00 00
(da6:mps0:0:5:0): CAM status: CCB request completed with an error
(da6:mps0:0:5:0): Retrying command, 3 more tries remain
mps1: Controller reported scsi ioc terminated tgt 7 SMID 936 loginfo 31130000
mps1: (da16:mps1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da16:mps1:0:7:0): CAM status: Command timeout
Controller reported scsi ioc terminated tgt 7 SMID 1762 loginfo 31130000
(da16:mps1:0:7:0): Retrying command, 0 more tries remain
mps1: Controller reported scsi ioc terminated tgt 7 SMID 296 loginfo 31140000
mps1: (da16:mps1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Controller reported scsi ioc terminated tgt 7 SMID 1690 loginfo 31140000
(da16:mps1:0:7:0): CAM status: CCB request completed with an error
(da16:mps1:0:7:0): Retrying command, 0 more tries remain
mps1: Controller reported scsi ioc terminated tgt 7 SMID 1004 loginfo 31140000
(da16:mps1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
mps1: Controller reported scsi ioc terminated tgt 7 SMID 832 loginfo 31140000
(da16:mps1:0:7:0): CAM status: CCB request completed with an error
mps1: (da16:mps1:0:7:0): Retrying command, 0 more tries remain
Controller reported scsi ioc terminated tgt 7 SMID 1709 loginfo 31140000
(da16:mps1:0:7:0): READ(16). CDB: 88 00 00 00 00 01 04 21 b0 78 00 00 00 40 00 00
mps1: (da16:mps1:0:7:0): CAM status: CCB request completed with an error
(da16:mps1:0:7:0): Retrying command, 3 more tries remain
Controller reported scsi ioc terminated tgt 7 SMID 653 loginfo 31140000
(da16:mps1:0:7:0): READ(16). CDB: 88 00 00 00 00 01 04 20 a1 90 00 00 00 08 00 00
mps1: (da16:mps1:0:7:0): CAM status: CCB request completed with an error
(da16:mps1:0:7:0): Retrying command, 3 more tries remain
Controller reported scsi ioc terminated tgt 7 SMID 576 loginfo 31140000
(da16:mps1:0:7:0): READ(16). CDB: 88 00 00 00 00 01 04 23 8b 68 00 00 00 08 00 00
mps1: Controller reported scsi ioc terminated tgt 7 SMID 1270 loginfo 31140000
mps1: Finished abort recovery for target 7
mps1: Unfreezing devq for target ID 7
mps1: Controller reported scsi ioc terminated tgt 3 SMID 388 loginfo 31130000
mps1: Controller reported scsi ioc terminated tgt 3 SMID 974 loginfo 31130000
mps1: Controller reported scsi ioc terminated tgt 3 SMID 1267 loginfo 31140000
mps1: (da16:mps1:0:7:0): CAM status: CCB request completed with an error
(da16:mps1:0:7:0): Retrying command, 3 more tries remain
Controller reported scsi ioc terminated tgt 3 SMID 594 loginfo 31140000
(da16:mps1:0:7:0): READ(16). CDB: 88 00 00 00 00 01 04 23 cd 78 00 00 00 08 00 00
mps1: (da16:mps1:0:7:0): CAM status: CCB request completed with an error
(da16:mps1:0:7:0): Retrying command, 3 more tries remain
Controller reported scsi ioc terminated tgt 3 SMID 1624 loginfo 31140000
(da16:mps1:0:7:0): READ(16). CDB: 88 00 00 00 00 01 04 21 3c d0 00 00 00 30 00 00
mps1: Controller reported scsi ioc terminated tgt 3 SMID 1437 loginfo 31140000
mps1: Controller reported scsi ioc terminated tgt 3 SMID 780 loginfo 31140000
mps1: Controller reported scsi ioc terminated tgt 3 SMID 596 loginfo 31140000
mps1: (da16:mps1:0:7:0): CAM status: CCB request completed with an error
(da16:mps1:0:7:0): Retrying command, 3 more tries remain
Controller reported scsi ioc terminated tgt 3 SMID 346 loginfo 31140000
(da16:mps1:0:7:0): READ(16). CDB: 88 00 00 00 00 01 04 23 8f 28 00 00 00 08 00 00
mps1: (da16:mps1:0:7:0): CAM status: CCB request completed with an error
(da16:mps1:0:7:0): Retrying command, 3 more tries remain
Finished abort recovery for target 3
(da16:mps1:0:7:0): READ(16). CDB: 88 00 00 00 00 01 04 23 a2 a8 00 00 00 08 00 00
mps1: (da16:mps1:0:7:0): CAM status: CCB request completed with an error
(da16:mps1:0:7:0): Retrying command, 3 more tries remain
Unfreezing devq for target ID 3
(da16:mps1:0:7:0): READ(16). CDB: 88 00 00 00 00 01 04 23 dd 40 00 00 00 10 00 00
(da16:mps1:0:7:0): CAM status: CCB request completed with an error
(da16:mps1:0:7:0): Retrying command, 3 more tries remain
(da12:mps1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da12:mps1:0:3:0): CAM status: Command timeout
(da12:mps1:0:3:0): Retrying command, 0 more tries remain
(da12:mps1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da12:mps1:0:3:0): CAM status: CCB request completed with an error
(da12:mps1:0:3:0): Retrying command, 0 more tries remain
(da12:mps1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da12:mps1:0:3:0): CAM status: CCB request completed with an error
(da12:mps1:0:3:0): Retrying command, 0 more tries remain
(da12:mps1:0:3:0): READ(10). CDB: 28 00 bb 9c dd f8 00 00 08 00
(da12:mps1:0:3:0): CAM status: CCB request completed with an error
(da12:mps1:0:3:0): Retrying command, 3 more tries remain
(da12:mps1:0:3:0): READ(10). CDB: 28 00 a1 52 67 c0 00 00 08 00
(da12:mps1:0:3:0): CAM status: CCB request completed with an error
(da12:mps1:0:3:0): Retrying command, 3 more tries remain
(da12:mps1:0:3:0): READ(10). CDB: 28 00 bb 9c 9d c8 00 00 08 00
(da12:mps1:0:3:0): CAM status: CCB request completed with an error
(da12:mps1:0:3:0): Retrying command, 3 more tries remain
(da12:mps1:0:3:0): READ(10). CDB: 28 00 bb 9c ee 30 00 00 28 00
(da12:mps1:0:3:0): CAM status: CCB request completed with an error
(da12:mps1:0:3:0): Retrying command, 3 more tries remain
(da12:mps1:0:3:0): READ(10). CDB: 28 00 73 cb 99 f8 00 00 08 00
(da12:mps1:0:3:0): CAM status: CCB request completed with an error
(da12:mps1:0:3:0): Retrying command, 3 more tries remain
(da12:mps1:0:3:0): READ(10). CDB: 28 00 73 d3 70 c8 00 00 08 00
(da12:mps1:0:3:0): CAM status: CCB request completed with an error
(da12:mps1:0:3:0): Retrying command, 3 more tries remain
(da12:mps1:0:3:0): READ(10). CDB: 28 00 bb 9c dc 78 00 00 08 00
(da12:mps1:0:3:0): CAM status: CCB request completed with an error
(da12:mps1:0:3:0): Retrying command, 3 more tries remain
mps0: Controller reported scsi ioc terminated tgt 4 SMID 1409 loginfo 31130000
mps0: Controller reported scsi ioc terminated tgt 4 SMID 274 loginfo 31130000
mps0: Controller reported scsi ioc terminated tgt 4 SMID 496 loginfo 31140000
(da5:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
mps0: (da5:mps0:0:4:0): CAM status: Command timeout
(da5:mps0:0:4:0): Retrying command, 0 more tries remain
Controller reported scsi ioc terminated tgt 4 SMID 973 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 4 SMID 1196 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 4 SMID 832 loginfo 31140000
mps0: (da5:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Controller reported scsi ioc terminated tgt 4 SMID 1016 loginfo 31140000
(da5:mps0:0:4:0): CAM status: CCB request completed with an error
(da5:mps0:0:4:0): Retrying command, 0 more tries remain
(da5:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
mps0: Controller reported scsi ioc terminated tgt 4 SMID 1245 loginfo 31140000
(da5:mps0:0:4:0): CAM status: CCB request completed with an error
(da5:mps0:0:4:0): Retrying command, 0 more tries remain
(da5:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 01 00 7e 29 08 00 00 00 08 00 00
mps0: (da5:mps0:0:4:0): CAM status: CCB request completed with an error
(da5:mps0:0:4:0): Retrying command, 3 more tries remain
Controller reported scsi ioc terminated tgt 4 SMID 1586 loginfo 31140000
(da5:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 01 00 7f 89 70 00 00 00 08 00 00
mps0: Controller reported scsi ioc terminated tgt 4 SMID 570 loginfo 31140000
mps0: Finished abort recovery for target 4
mps0: Unfreezing devq for target ID 4
mps0: (da5:mps0:0:4:0): CAM status: CCB request completed with an error
Controller reported scsi ioc terminated tgt 2 SMID 492 loginfo 31130000
(da5:mps0:0:4:0): Retrying command, 3 more tries remain
mps0: Controller reported scsi ioc terminated tgt 2 SMID 1994 loginfo 31130000
(da5:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 01 00 7e e7 88 00 00 00 08 00 00
mps0: (da5:mps0:0:4:0): CAM status: CCB request completed with an error
(da5:mps0:0:4:0): Retrying command, 3 more tries remain
(da5:mps0:0:4:0): READ(10). CDB: 28 00 ff 79 ae 68 00 00 08 00
Controller reported scsi ioc terminated tgt 2 SMID 1545 loginfo 31140000
(da5:mps0:0:4:0): CAM status: CCB request completed with an error
mps0: (da5:mps0:0:4:0): Retrying command, 3 more tries remain
Controller reported scsi ioc terminated tgt 2 SMID 1988 loginfo 31140000
(da5:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 01 00 7f 7e 80 00 00 00 08 00 00
mps0: (da5:mps0:0:4:0): CAM status: CCB request completed with an error
(da5:mps0:0:4:0): Retrying command, 3 more tries remain
Controller reported scsi ioc terminated tgt 2 SMID 643 loginfo 31140000
(da5:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 01 00 7f 83 c8 00 00 00 08 00 00
mps0: Controller reported scsi ioc terminated tgt 2 SMID 591 loginfo 31140000
(da5:mps0:0:4:0): CAM status: CCB request completed with an error
(da5:mps0:0:4:0): Retrying command, 3 more tries remain
(da5:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 01 00 7f 4f 88 00 00 00 10 00 00
mps0: (da5:mps0:0:4:0): CAM status: CCB request completed with an error
(da5:mps0:0:4:0): Retrying command, 3 more tries remain
Controller reported scsi ioc terminated tgt 2 SMID 1500 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 2 SMID 2081 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 2 SMID 850 loginfo 31140000
mps0: Controller reported scsi ioc terminated tgt 2 SMID 1057 loginfo 31140000
mps0: Finished abort recovery for target 2
mps0: Unfreezing devq for target ID 2
(da5:mps0:0:4:0): READ(16). CDB: 88 00 00 00 00 01 00 7f 9b 88 00 00 00 10 00 00
(da5:mps0:0:4:0): CAM status: CCB request completed with an error
(da5:mps0:0:4:0): Retrying command, 3 more tries remain
(da3:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps0:0:2:0): CAM status: Command timeout
(da3:mps0:0:2:0): Retrying command, 0 more tries remain
(da3:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 0 more tries remain
(da3:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 0 more tries remain
(da3:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 01 04 21 8b 48 00 00 00 08 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 3 more tries remain
(da3:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 01 04 23 57 80 00 00 00 10 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 3 more tries remain
(da3:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 01 04 20 b6 80 00 00 00 08 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 3 more tries remain
(da3:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 01 04 1e 36 90 00 00 00 08 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 3 more tries remain
(da3:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 01 04 23 6c f8 00 00 00 08 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 3 more tries remain
(da3:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 01 04 23 8b 68 00 00 00 08 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 3 more tries remain
(da3:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 01 04 23 4b 30 00 00 00 08 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 3 more tries remain
(da3:mps0:0:2:0): READ(16). CDB: 88 00 00 00 00 01 04 23 74 c8 00 00 00 08 00 00
(da3:mps0:0:2:0): CAM status: CCB request completed with an error
(da3:mps0:0:2:0): Retrying command, 3 more tries remain
(da16:mps1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da16:mps1:0:7:0): CAM status: SCSI Status Error
(da16:mps1:0:7:0): SCSI status: Check Condition
(da16:mps1:0:7:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da16:mps1:0:7:0): Error 6, Retries exhausted
(da16:mps1:0:7:0): Invalidating pack
(da12:mps1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da12:mps1:0:3:0): CAM status: SCSI Status Error
(da12:mps1:0:3:0): SCSI status: Check Condition
(da12:mps1:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da12:mps1:0:3:0): Error 6, Retries exhausted
(da12:mps1:0:3:0): Invalidating pack
(da5:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da5:mps0:0:4:0): CAM status: SCSI Status Error
(da5:mps0:0:4:0): SCSI status: Check Condition
(da6:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da5:mps0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da6:mps0:0:5:0): CAM status: SCSI Status Error
(da5:mps0:0:4:0): Error 6, Retries exhausted
(da6:mps0:0:5:0): SCSI status: Check Condition
(da6:mps0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da5:mps0:0:4:0): Invalidating pack
(da6:mps0:0:5:0): Error 6, Retries exhausted
(da6:mps0:0:5:0): Invalidating pack
(da3:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps0:0:2:0): CAM status: SCSI Status Error
(da3:mps0:0:2:0): SCSI status: Check Condition
(da3:mps0:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da3:mps0:0:2:0): Error 6, Retries exhausted
(da3:mps0:0:2:0): Invalidating pack
Solaris: WARNING: Pool 'tank' has encountered an uncorrectable I/O failure and has been suspended.


zpool status -v output is

Code:
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Aug  7 22:29:40 2022
    15.2T scanned at 180M/s, 5.18T issued at 61.4M/s, 25.3T total
    437G resilvered, 20.49% done, 3 days 23:29:26 to go
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      raidz3-0                                      ONLINE       0    16     0
        gptid/25154c0a-9652-11eb-98e6-000c29be6648  ONLINE       0     0   434  (resilvering)
        gptid/254a36d3-9652-11eb-98e6-000c29be6648  ONLINE       0     0   142  (resilvering)
        gptid/26a5958f-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/796d52d2-b6ca-11eb-b288-000c29be6648  ONLINE   1.08K  1005     0
        gptid/268a4690-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/276725a8-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/284d86dc-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/5e647216-b70c-11eb-b288-000c29be6648  ONLINE   4.81K 4.38K     0
        gptid/2826445a-9652-11eb-98e6-000c29be6648  ONLINE   6.70K 7.65K     0
        gptid/287f6272-9652-11eb-98e6-000c29be6648  ONLINE   6.01K 7.04K     0
        gptid/28897e41-9652-11eb-98e6-000c29be6648  ONLINE   5.87K 6.95K     0
        gptid/476d231d-1687-11ed-90d7-000c29be6648  ONLINE       0     0 20.1K  (resilvering)
    spares
      gptid/e19d6609-65ba-11ec-bf4a-000c29be6648    AVAIL
      gptid/e2208018-65ba-11ec-bf4a-000c29be6648    AVAIL
      gptid/e23a6cf8-65ba-11ec-bf4a-000c29be6648    AVAIL


The ESXi host seems to be acting up as well (I'm unable to edit the settings of the TrueNAS VM), so it might be an ESXi issue as well.

So far I haven't touched either TrueNAS or the ESXi host, besides gathering logs.

I would appreciate any and all suggestions for my following steps for 1) investigating and resolving the underlying issue and 2) attempting to bring the pool back online (preferably with all the data =) ).

Thanks in advance!
 

Alibuba

Cadet
Joined
Jan 30, 2021
Messages
6
Just a quick update on this. I ended up taking both the VMs on the host and the host offline. The system booted without issues, and everything seemed fine in LSI BIOS. After launching ESXi I brought TrueNAS (12.0-U8) back online. It took quite a while for the pool to be brought online, but eventually TrueNAS finished booting the pool is now once again being resilvered with no data errors.

Code:
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Aug  7 22:29:40 2022
    11.1T scanned at 7.52G/s, 1.23T issued at 854M/s, 25.3T total
    103G resilvered, 4.86% done, 08:12:29 to go
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      raidz3-0                                      ONLINE       0     0     0
        gptid/25154c0a-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/254a36d3-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/26a5958f-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/796d52d2-b6ca-11eb-b288-000c29be6648  ONLINE       0     0     0
        gptid/268a4690-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/276725a8-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/284d86dc-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/5e647216-b70c-11eb-b288-000c29be6648  ONLINE       0     0     0
        gptid/2826445a-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/287f6272-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/28897e41-9652-11eb-98e6-000c29be6648  ONLINE       0     0     0
        gptid/476d231d-1687-11ed-90d7-000c29be6648  ONLINE       0     0     2  (resilvering)
    spares
      gptid/e19d6609-65ba-11ec-bf4a-000c29be6648    AVAIL
      gptid/e2208018-65ba-11ec-bf4a-000c29be6648    AVAIL
      gptid/e23a6cf8-65ba-11ec-bf4a-000c29be6648    AVAIL

errors: No known data errors


Hopefully everything goes smoothly this time around
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
All I can add is that chksum errors are often, but not always the result of a cabling issue.
Also, make sure that the LSI cards have plenty of airflow
 

Alibuba

Cadet
Joined
Jan 30, 2021
Messages
6
All I can add is that chksum errors are often, but not always the result of a cabling issue.
Also, make sure that the LSI cards have plenty of airflow
Thanks, @NugentS. The system is definitely running "hottish", with the ambient temperature being around 30C. Even with plenty of fans, the internal temperature is surely between 40-50C.

Many of the 3.5" bays in the case are quite cramped, so there are some tight bends in the SFF-8047 to SATA-cable.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Given that you had multiple chksum errors over multiple devices. I would look at some additional cooling on the LSI Card.

I could howver be leading you down the garden path - its just an idea
 
Top