Unhealthy Pool thousands of checksum errors on two of four drives

whizzard · Jan 23, 2024

I noticed today that one of my ZFS pools was unhealthy and that two of the four disks have over 10,000 checksum errors. I am running a scrub now that will take around 11 hours, but searching the forums, I haven't found any circumstances of pool errors quite like this. My system specs are in my signature, and the checksum error pool is from four Seagate Exos X18 18TB SAS 12Gb/s Enterprise HDD (in the checksum error pool). All the drives are connected to the host controller using mini-sas.

The only thing I did that I usually don't do was that I did a copy using an SSH connection to the bare metal server from one pool to another instead of moving the files through the network like I always do. I'm unsure if that could cause all the checksum errors. Also, do you know why two of the four drives would have all the errors and the other two would not? The smart status is OK on all the drives.

The files on the pool seem OK. Is there any way to "fix" this without rebuilding the entire pool? It wouldn't be the end of the world, as these are my newest drives, but I have a lot of data already copied to it and would not like to have to redo all that backing up (it took days). Could it be my controller, as only two drives have checksum errors?

I'm just looking for a start to diagnosing any issue, and the system works great other than this new "unhealthy" flag that the new pool popped up. I have another large storage pool and several smaller pools that have never been unhealthy on the same controller. Thanks, and I hope that you have all the information you need and that this is posted in the appropriate forum. I have not had the need to post here very often.

NugentS · Jan 23, 2024

Were the drives new?
Did you stress test them before using them
Is there a common cable between the two drives - chksum errors are often (but not always) cabling issues. Remote and reseat the common cable (if there is one)

whizzard · Jan 23, 2024

No, I did not stress test the drives. I did move a lot of data to the pool, though, when I first got them and it seemed fine. So, you suspect the cable because the two drives may share the same cable? The copy from one pool to another through SSH on the bare metal shouldn't be why it happened then. Thanks for the tip; I will try reseating them after the scrub. I never heard of stress testing drives.. is that something you do with a script or is there another way to do that?

NugentS · Jan 23, 2024

I stress test them by:
1. running smartctl -t long to get a baseline
2. badblocks (destructive)
3. another smartctl -t long to get results - but its really the badblocks results that matter.

@dak180 has a script that does all this for you. https://github.com/dak180/disk-burnin-and-testing/blob/topic/burnin/disk-burnin.sh#L138

whizzard · Jan 23, 2024

NugentS said:
I stress test them by:
1. running smartctl -t long to get a baseline
2. badblocks (destructive)
3. another smartctl -t long to get results - but its really the badblocks results that matter.

@dak180 has a script that does all this for you. https://github.com/dak180/disk-burnin-and-testing/blob/topic/burnin/disk-burnin.sh#L138

Last question: is the data written with those checksum errors corrupt; and will I need to rewrite it if all that was wrong was a cable issue?

and thank you very much for your help!

NugentS · Jan 23, 2024

I don't think so - but I don't know.

dak180 · Jan 23, 2024

I think I might write up a man page and maybe a resource at some point during my next vacation.

whizzard · Jan 25, 2024

I have been avoiding opening up my server as it is a pain; I'm just not using the pool now. Once I try reseating the cable, I will let you know what happens. I looked up the serials of the affected drives and will focus on that cable since the rest seems fine. The data seems OK, but I'm curious if it might have problems. If anyone knows, please let me know.

whizzard · Jan 29, 2024

OK, so I opened it up; it was quite a bit more dusty than I thought, so I cleaned it out. I found the two drives by serial that were throwing errors and reseated them into the machine, as well as reseated both mini-sas cables back into the HBA. After getting the system back on, it now reports no errors and that the pool is healthy. That is good news. The errors disappeared without my intervention. Is ZFS doing its magic? I am pushing data to it now to test it out a bit.

My question is, can I run those stress tests without losing the data that is on the pool? I know I should have tested it before uploading TBs of data to the pool, but here we are. Thanks again for the help. I truly appreciate it.

NugentS · Jan 29, 2024

The stress tests (the badblocks part) are destructive (for best results)
You can run badblocks in a non destructive read write mode - where it reads the data first and presumably writes it back afterwards (but is even slower)
Or, as an alternative see: badblocks which contains a faster alternative apparently and the commands seem to exist in Scale

PK1048 · Jan 31, 2024

The ZFS Checksum error count per device is reset upon boot. I would start a scrub on the pool and let it complete before loading any new data. You do not NEED to scrub, as ZFS will find the bad (checksum) blocks when reading the data (assuming a redundant topology, mirror or raidZ) and automatically correct and re-write the bad copy.

You can also clear the ZFS checksum error count via `zpool clear <pool>` if you know the cause (a drive hot pulled and replaced, for example). Not a good idea to clear that counter if you do not know why it was not zero.

I have seen the checksum count go up due to port multiplier errors and bad cables/controller more often than bad disk, but I tend to replace drives for capacity reasons before they fail.

Important Announcement for the TrueNAS Community.

Unhealthy Pool thousands of checksum errors on two of four drives

whizzard

Dabbler

NugentS

MVP

whizzard

Dabbler

NugentS

MVP

whizzard

Dabbler

NugentS

MVP

dak180

Patron

whizzard

Dabbler

whizzard

Dabbler

NugentS

MVP

PK1048

Cadet

Similar threads