Unhealthy Pool thousands of checksum errors on two of four drives

whizzard

Dabbler
Joined
Mar 10, 2023
Messages
11
I noticed today that one of my ZFS pools was unhealthy and that two of the four disks have over 10,000 checksum errors. I am running a scrub now that will take around 11 hours, but searching the forums, I haven't found any circumstances of pool errors quite like this. My system specs are in my signature, and the checksum error pool is from four Seagate Exos X18 18TB SAS 12Gb/s Enterprise HDD (in the checksum error pool). All the drives are connected to the host controller using mini-sas.

The only thing I did that I usually don't do was that I did a copy using an SSH connection to the bare metal server from one pool to another instead of moving the files through the network like I always do. I'm unsure if that could cause all the checksum errors. Also, do you know why two of the four drives would have all the errors and the other two would not? The smart status is OK on all the drives.

The files on the pool seem OK. Is there any way to "fix" this without rebuilding the entire pool? It wouldn't be the end of the world, as these are my newest drives, but I have a lot of data already copied to it and would not like to have to redo all that backing up (it took days). Could it be my controller, as only two drives have checksum errors?

I'm just looking for a start to diagnosing any issue, and the system works great other than this new "unhealthy" flag that the new pool popped up. I have another large storage pool and several smaller pools that have never been unhealthy on the same controller. Thanks, and I hope that you have all the information you need and that this is posted in the appropriate forum. I have not had the need to post here very often.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Were the drives new?
Did you stress test them before using them
Is there a common cable between the two drives - chksum errors are often (but not always) cabling issues. Remote and reseat the common cable (if there is one)
 

whizzard

Dabbler
Joined
Mar 10, 2023
Messages
11
No, I did not stress test the drives. I did move a lot of data to the pool, though, when I first got them and it seemed fine. So, you suspect the cable because the two drives may share the same cable? The copy from one pool to another through SSH on the bare metal shouldn't be why it happened then. Thanks for the tip; I will try reseating them after the scrub. I never heard of stress testing drives.. is that something you do with a script or is there another way to do that?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947

whizzard

Dabbler
Joined
Mar 10, 2023
Messages
11
I stress test them by:
1. running smartctl -t long to get a baseline
2. badblocks (destructive)
3. another smartctl -t long to get results - but its really the badblocks results that matter.

@dak180 has a script that does all this for you. https://github.com/dak180/disk-burnin-and-testing/blob/topic/burnin/disk-burnin.sh#L138
Last question: is the data written with those checksum errors corrupt; and will I need to rewrite it if all that was wrong was a cable issue?

and thank you very much for your help!
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I don't think so - but I don't know.
 

dak180

Patron
Joined
Nov 22, 2017
Messages
310
I think I might write up a man page and maybe a resource at some point during my next vacation.
 

whizzard

Dabbler
Joined
Mar 10, 2023
Messages
11
I have been avoiding opening up my server as it is a pain; I'm just not using the pool now. Once I try reseating the cable, I will let you know what happens. I looked up the serials of the affected drives and will focus on that cable since the rest seems fine. The data seems OK, but I'm curious if it might have problems. If anyone knows, please let me know.
 

whizzard

Dabbler
Joined
Mar 10, 2023
Messages
11
OK, so I opened it up; it was quite a bit more dusty than I thought, so I cleaned it out. I found the two drives by serial that were throwing errors and reseated them into the machine, as well as reseated both mini-sas cables back into the HBA. After getting the system back on, it now reports no errors and that the pool is healthy. That is good news. The errors disappeared without my intervention. Is ZFS doing its magic? I am pushing data to it now to test it out a bit.

My question is, can I run those stress tests without losing the data that is on the pool? I know I should have tested it before uploading TBs of data to the pool, but here we are. Thanks again for the help. I truly appreciate it.
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
The stress tests (the badblocks part) are destructive (for best results)
You can run badblocks in a non destructive read write mode - where it reads the data first and presumably writes it back afterwards (but is even slower)
Or, as an alternative see: badblocks which contains a faster alternative apparently and the commands seem to exist in Scale
 

PK1048

Cadet
Joined
Feb 23, 2022
Messages
7
The ZFS Checksum error count per device is reset upon boot. I would start a scrub on the pool and let it complete before loading any new data. You do not NEED to scrub, as ZFS will find the bad (checksum) blocks when reading the data (assuming a redundant topology, mirror or raidZ) and automatically correct and re-write the bad copy.

You can also clear the ZFS checksum error count via `zpool clear <pool>` if you know the cause (a drive hot pulled and replaced, for example). Not a good idea to clear that counter if you do not know why it was not zero.

I have seen the checksum count go up due to port multiplier errors and bad cables/controller more often than bad disk, but I tend to replace drives for capacity reasons before they fail.
 
Top