8 drives faulted at the same time?

T4ke

Cadet
Joined
Jul 22, 2021
Messages
6
Hi there,
I'm on TrueNAS Scale 22.12.2, running on ESXi 7.0 U3.
Since about two weeks I get read, write and especially checksum errors all the time on the pool, and they are getting more and more in number.
At the beginning there were only 20-50 checksum errors, now there are 200k and more within a very short time.
I have already changed the controller twice, checked and replaced all cables, re-seated all drives, changed drive bays and even migrated the system via vMotion to another ESXi host, because I did not want to exclude a damaged backplane either. Unfortunately, the errors are also present on the second ESXi host.
I really have no clue what could cause this many errors, on practically new drives, they were purchased in January this year. The controllers also seem to be fine.
The drives are 8 x ST12000NM002G, configured in RAID-Z2. The SMART data doesn't show anything suspicious, disks seem to be fine.

Specs ESXi host 01:
VMware ESXi, 7.0.3, 21686933
Supermicro X12SPL-F
Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz
192GB DDR4 ECC RAM

Specs ESXi host 02:
VMware ESXi, 7.0.3, 21686933
Supermicro X11SRA-RF
Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz
64GB DDR4 ECC RAM

HBAs tested:
LSI 9300-16i (original HBA)
2 HPE Smart Array H240 (backup HBAs)
All flashed / configured in IT mode on latest firmware.
The HBAs are in PCI passthrough mode to the TrueNAS VM.
I still also got an old Adaptec 71605 laying around but didn't test it yet.

'zpool status tank01 -v' currently shows the following:

Code:
  pool: tank01
  pool: tank01
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 49.7G in 00:12:14 with 2045 errors on Mon May 15 15:18:36 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        tank01                                    DEGRADED     0     0     0
          raidz2-0                                DEGRADED     0     0     0
            c751ab6b-6fe5-49e2-a637-c373611c4c77  DEGRADED     0     0 5.89K  too many errors
            9d5db49f-15b6-44e2-8d50-3768cd32304b  DEGRADED     0     0 5.90K  too many errors
            8fa742b0-bd8f-4423-a1d7-36e3a6d0781e  DEGRADED     0     0 5.90K  too many errors
            b490113d-cc59-4ccd-ab8e-8ca953f9820f  DEGRADED     0     0 5.75K  too many errors
            eed08a51-f1f3-4aa9-895a-9fef0aad0ecf  DEGRADED     0     0 5.58K  too many errors
            6a58ae25-6998-4ea1-af30-83e0f1546f66  DEGRADED     0     0 5.41K  too many errors
            7e31423d-2a9f-412c-a1e2-954a85708974  DEGRADED     0     0 5.72K  too many errors
            62e33e68-1443-406f-becc-60e486da5889  DEGRADED     0     0 5.75K  too many errors

errors: Permanent errors have been detected in the following files:

        tank01/Backup:<0x0>
        tank01/Backup:<0x1903>
        tank01/Backup:<0x1905>
        tank01/Backup:<0x1719>
        tank01/Backup:<0x1830>
        tank01/Backup:<0x1940>
        tank01/Backup:<0x1943>
        tank01/Backup:<0x1945>
        /mnt/tank01/Backup/wordpress_backups/file1.zip
        /mnt/tank01/Backup/wordpress_backups/file2.zip
        /mnt/tank01/Backup/wordpress_backups/file3.zip
        /mnt/tank01/Backup/wordpress_backups/file4.zip
        tank01/Backup:<0x1755>
        tank01/Backup:<0x195a>
        tank01/Backup:<0x18a5>
        tank01/Backup:<0x19ad>
        tank01/Backup:<0x17ae>
        tank01/Backup:<0x17af>
        tank01/Backup:<0x17b0>
        /mnt/tank01/Backup/wordpress_backups/file5.zip


I just can't imagine 8 practically new hard drives going belly up at the same time.
Does anyone have a hint that could get me going in the right direction?
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I still also got an old Adaptec 71605 laying around but didn't test it yet.
Don't do that, it won't be helpful in the long-term (even if it does work initially).

even migrated the system via vMotion to another ESXi host,
I don't think you can do that with a PCI passthrough in place... what's happening to allow that?
 

T4ke

Cadet
Joined
Jul 22, 2021
Messages
6
Don't do that, it won't be helpful in the long-term (even if it does work initially).


I don't think you can do that with a PCI passthrough in place... what's happening to allow that?
My bad, you are right. I wasn't 'vMotion'-ing, I shut down the VM, removed the PCI passthrough and cold migrated the VM to the other host. There I re-added the HBA in passthrough. Thanks for correcting me.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I recall there was an issue with Ironwolf drives and dropouts several years ago that was related to write caches and command queuing. Drives would experience command timeouts under loads. Disabling write cache and NCQ prevented these failures but at a significant cost to performance.


According to the users there, it was ultimately resolved by a firmware update being released from Seagate.

Can you check to see if there's a firmware update available for your drive(s)? Seagate requires a serial number to search.

 

T4ke

Cadet
Joined
Jul 22, 2021
Messages
6
Hey mate, thanks for the advice. Unfortunately there are no firmware updates available for my drives.
Right now I'm scrubbing the pool on the second esxi host and see where it goes, drives are attached to one of the HP H240s, checksum errors still popping up but at least there are no write or read errors any more.
Any advice is welcomed.


Code:
  pool: tank01
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Mon May 15 16:10:00 2023
        21.6T scanned at 1.42G/s, 20.5T issued at 1.35G/s, 30.6T total
        16K repaired, 66.87% done, 02:08:36 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        tank01                                    DEGRADED     0     0     0
          raidz2-0                                DEGRADED     0     0     0
            c751ab6b-6fe5-49e2-a637-c373611c4c77  DEGRADED     0     0 8.07K  too many errors
            9d5db49f-15b6-44e2-8d50-3768cd32304b  DEGRADED     0     0 8.08K  too many errors
            8fa742b0-bd8f-4423-a1d7-36e3a6d0781e  DEGRADED     0     0 8.08K  too many errors
            b490113d-cc59-4ccd-ab8e-8ca953f9820f  DEGRADED     0     0 7.90K  too many errors
            eed08a51-f1f3-4aa9-895a-9fef0aad0ecf  DEGRADED     0     0 7.44K  too many errors
            6a58ae25-6998-4ea1-af30-83e0f1546f66  DEGRADED     0     0 7.10K  too many errors
            7e31423d-2a9f-412c-a1e2-954a85708974  DEGRADED     0     0 7.72K  too many errors
            62e33e68-1443-406f-becc-60e486da5889  DEGRADED     0     0 7.92K  too many errors
 
Joined
Jun 15, 2022
Messages
674
The SMART data doesn't show anything suspicious, disks seem to be fine.
I just can't imagine 8 practically new hard drives going belly up at the same time.
Does anyone have a hint that could get me going in the right direction?
@T4ke : Did you run smartctl -xall on the bare metal machines?
 
Top