Checksum Errors on New Drive after Resilvering

Joined
Aug 29, 2023
Messages
6
So i’m having some issues with my NAS and rebuilding an array (TruNAS) let me start at the beginning.

Setup 8x10TB (half Exos half WD RED PRO) in RAID Z2
SAS2008 card (Now SAS3008 in IT mode FW 16.0.12.0)
Ryzen 5 + Server Mobo
128 GB ECC DDR4 RAM

  1. I started seeing Checksum errors on one of my drives… tried some scrubs and clears and the numbers started jumping around for the number of errors (61,133,271,45,) at this point i decided to replace the drive
  2. I replaced the drive but the new replacement drive is showing the same Checksum errors and Scrubs are not fixing them…. still jumping around after clears and scrubs (150,61,23)…. tried swapping SAS ports…… issue follows drive
  3. Figured new drive was damaged during shipping, got second replacement WD RED PRO and resilvered the array again…… Checksum errors return and drive errors out ….scrubs don’t fix it and checksum errors continue after clears and repeated scrubs
  4. Figured it was a SAS card issue… replaced with SAS3008 card + new wiring…. wiped drive and did rebuild and still seeing Checksum errors tried scrubs and clearing again but numbers jump around after each scrub (1,61,32)
  5. Upgraded SAS HBA firmware based on recommended firmware…. checksum issue still occurs
  6. Put new drive on Mobo SATA port to see if there was some R/W issue with the HBA …. checksum issue still occurs
  7. removed Cache Drives and resilvered + Scrub ….issue still occurs
  8. pulled both new drives in to a separate system checked for errors or SMART issues and wiped with all zeros put back in TrueNAS resilvered + Scrub….. checksum issue still occurs

I do have all this data backed up but i’m concerned that ZFS is unable to repair this and this is a larger issue with hardware or a bug and is makeing me question my choices…

What do?


also pulled this from the ZFS log all of the checksum errors are the same with the “BAD_SET_HISTOGRAM= 0x0 0x0……” which is odd


Code:
Aug 27 2023 17:15:56.924972503 ereport.fs.zfs.checksum
        class = "ereport.fs.zfs.checksum"
        ena = 0x4ebc9aad14000801
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x70469b758805927c
                vdev = 0x9953bc42d191f128
        (end detector)
        pool = "storage_pool"
        pool_guid = 0x70469b758805927c
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "continue"
        vdev_guid = 0x9953bc42d191f128
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-partuuid/835c0980-932c-4093-95c9-37e1d0fb4478"
        vdev_ashift = 0x9
        vdev_complete_ts = 0x54ebc9a12ba7
        vdev_delta_ts = 0xf973
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x1
        vdev_delays = 0x0
        parent_guid = 0x22ce662618e98491
        parent_type = "raidz"
        vdev_spare_paths =
        vdev_spare_guids =
        zio_err = 0x0
        zio_flags = 0x1008b0
        zio_stage = 0x200000
        zio_pipeline = 0x1f00000
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_priority = 0x4
        zio_offset = 0x27906359000
        zio_size = 0x1000
        zio_objset = 0x0
        zio_object = 0x0
        zio_level = 0x0
        zio_blkid = 0x0
        bad_ranges = 0x0 0x1d0 0x1e8 0x768 0x780 0x7d8 0x7e8 0x800
        bad_ranges_min_gap = 0x8
        bad_range_sets = 0x0 0x0 0x0 0x0
        bad_range_clears = 0xe67 0x2bf4 0x28b 0x85
        bad_set_histogram = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
        bad_cleared_histogram = 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf5 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf5 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf6 0xf5 0xf5 0xf5 0xf5 0xf5 0xf5 0xf5 0xf5 0xf5 0xf5 0xf5 0xf6 0xf6 0xf6 0xf6 0xf6 0xf5 0xf5 0xf5 0xf5 0xf5 0xf5 0xf5 0xf5
        time = 0x64ebbd0c 0x3721f5d7
        eid = 0xce

Aug 27 2023 17:38:48.402709403 ereport.fs.zfs.checksum
        class = "ereport.fs.zfs.checksum"
        ena = 0x62b1b18cc4000001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x70469b758805927c
                vdev = 0x9953bc42d191f128
        (end detector)
        pool = "storage_pool"
        pool_guid = 0x70469b758805927c
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "continue"
        vdev_guid = 0x9953bc42d191f128
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-partuuid/835c0980-932c-4093-95c9-37e1d0fb4478"
        vdev_ashift = 0x9
        vdev_complete_ts = 0x562b1ae18c7a
        vdev_delta_ts = 0x22f8e12
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x2
        vdev_delays = 0x0
        parent_guid = 0x22ce662618e98491
        parent_type = "raidz"
        vdev_spare_paths =
        vdev_spare_guids =
        zio_err = 0x0
        zio_flags = 0x100880
        zio_stage = 0x200000
        zio_pipeline = 0x1f00000
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_priority = 0x0
        zio_offset = 0x27a4e4bb000
        zio_size = 0x1000
        zio_objset = 0x71
        zio_object = 0x1118
        zio_level = 0x0
        zio_blkid = 0x0
        bad_ranges = 0x0 0x518 0x530 0x800
        bad_ranges_min_gap = 0x8
        bad_range_sets = 0x0 0x0
        bad_range_clears = 0x2891 0x167c
        bad_set_histogram = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
        bad_cleared_histogram = 0xfd 0xfd 0xfd 0xfd 0xfc 0xfc 0xfc 0xfc 0xfd 0xfd 0xfd 0xfd 0xfd 0xfd 0xfd 0xfd 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfd 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc
        time = 0x64ebc268 0x1800db9b
        eid = 0xcf
 
Joined
Oct 22, 2019
Messages
3,641
Sounds like a combination HBA and/or cabling and/or ports and/or motherboard.
 
Joined
Aug 29, 2023
Messages
6
Sounds like a combination HBA and/or cabling and/or ports and/or motherboard.
I have replaced with brand new HBA and brand new cables…. I have swapped ports (both on the HBA and to the Mobo) and the issue stays with the drive after the resilver…..

to be honest this looks like either a bug in the resilver process or a bug in the scrubbing process or…. The whole zfs pool is FUBAR and it’s failing to rebuild as there is larger undetected massive data corruption.

Not saying this isn’t a hardware issue but ive replaced every in the chain and im not seeing ECC or processor errors
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Which version of SCALE?
Which drive model had original issue and then you replaced with WD RED Pro?
 
Joined
Aug 29, 2023
Messages
6
Original Drive: WD PRO WDC_WD102KFBX
New Drive #1 : WD PRO WDC_WD102KFBX
New Drive #2 : WD PRO WDC_WD102KFBX
3rd Testing Drive: Elements_25A3 (just to see if it was some sort of Drive FW issue)

**Note all of these experience the same checksum after resilver issue both on a SAS2008 Controller and a SAS3008 controller or on mobo sata also should note that this array was originally created on TrueNAS Core about 1.5 years ago

I have also checked dmesg to see if there are any HW issues and not seeing anything, also checked the IPMI controller for ECC errors but i’m only seeing NTP clock syncs (normal) for the last 30 days.

Version
TrueNAS-SCALE-22.12.3.3


Some next steps I’m thinking is

  1. Do a replacement on a drive that has not failed and see if i get the errors
  2. Remove all but one stick of ECC RAM…do resilver + scrub
  3. move array to new mobo and ram and see if rebuild process works (i.e just taking the HBA Card out putting it in the PCI slot of another system running TRUNAS….
 
Joined
Aug 29, 2023
Messages
6
So Tried
  1. Do a replacement on a drive that has not failed and see if i get the errors
and just like the others somehow this one got errors too..... so the premise still stands either somthing is injecting errors (or a bug thinking there is errors and there is not) or the whole pool is FUBAR.... see screenshot below

now attempting #2 from above to see if its ECC/RAM related

Screenshot 2023-08-30 at 17.16.17.png
 
Joined
Aug 29, 2023
Messages
6
Did the single RAM Stick test and the issue still occurs…. I am now trying a rebuild on an Intel system with ECC memory…. but its not looking good…
 
Joined
Aug 29, 2023
Messages
6
okay so did a rebuild on a spare intel system and no checksum errors..... doing a scrub to be sure but I'm really confused

Same HBA, Same cables, Same Drives.... same-ish PSU (PSU only running Drives vs whole system)

So it might be:
  • Bad CPU
  • Bad PSU (550W might be a little undersized for 8 Drives and 2 NVMES + 5600x)
  • Bad Memory (least expected)
  • Bad Mobo?
  • Weird ZFS bug with Ryzen 5600x + X570

So now i need the communities help what should I replace first CPU or PSU.... anywhere else i should look for errors? maybe a BIOS Setting?
 

Dash_Ripone

Cadet
Joined
Feb 18, 2023
Messages
5
I have a similar issue, I tried replacing the drives and they immediately have a checksum issue when finished.
Im running WD plus drives. at this point it has been going on since I switched to scale but have not noticed any issues with data integrity.
Hence I have given up trying to fix it and am living dangerously
Screen Shot 2023-11-12 at 13.24.44.png
 
Top