Help request: troubleshoot pool checksum errors

ppmax

Contributor
Joined
May 16, 2012
Messages
111
Hello--

I have a raidz2 pool with 4 drives (see my sig). This pool functions as a backup for lots of pictures and other media. I do not keep backups of this pool. I've been using FreeNAS since 2012 or so and have been able to diagnose/troubleshoot and repair some of the minor disk-related issues that have popped up.

2 months ago I started getting some intermittent errors after scrubs; I'd just delete the odd file or two and move on. At some point I started seeing some application crashes (Channels-DVR), so I started doing long SMART tests and eventually found that 1 drive was starting to fail...so I replaced it and the new disk was resilvered.

Subsequent scrubs have resulted in degraded pool status with thousands of checksum errors (but no read or write errors) and permanent errors detected in about 2800 files.
1583008388884.png


Strangely, long SMART tests don't show any sector issues or offline uncorrectables, zpool clear returns the pool to Healthy status, and I haven't had any problems opening any of a random sampling of the 2800 files that apparently have permanent errors. Channels-DVR still crashes intermittently which is worrying...but PLEX hasn't crashed once. With the assumption that the 2800 odd files are actually fine, I've tried moving (instead of deleting) these files to a new directory with the hope that doing so will result in a successful scrub. A new scrub just completed and I've still got 2800 files that have permanent errors, and the screenshot above shows thousands of checksum errors.

I've read all the relevant illumous.org docs....but I have a feeling there is something sinister lurking behind all these checksum errors but can't put my finger on it. I've even pulled this box apart and re-seated all SATA cables, RAM, etc. FWIW I tried creating a bootable USB with memtest86 but haven't been able to get this box to boot from that stick...so I haven't been able to eliminate possible issues with RAM.

Although this pool is one of a few separate backups I have of various things, I'm loathe to do anything radical that would require a weeks-long rsync from authoritative sources...so I'm consulting with the community before doing anything drastic. If anyone has any suggestions I would appreciate it!

thx
PP
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Cabling is usually at fault in the case you are presenting here where no read or write errors are shown, but checksum is being brought into play.

I've even pulled this box apart and re-seated all SATA cables
That's the right direction, but perhaps you need to investigate the cables themselves.
 

ppmax

Contributor
Joined
May 16, 2012
Messages
111
Thank you for your reply sretalla--much appreciated. I'll power down the box and try some new cables.

Out of curiosity:
Can file corruption and all these checksum errors be caused by network-related or power-related issues?
 
Top