Disk Checksum Errors - Is this normal?

UDSGuy

Cadet
Joined
Sep 26, 2023
Messages
4
Howdy folks -

I'm relatively new to the TrueNAS world, although I've worked in the NAS industry for 10+ years. A few months ago I built a TrueNAS Scale machine to replace my now unsupported Western Digital EX2 device I've had here home for the past 7 years. I enjoyed setting it up and love the ability to use it to host VMs that I spin up and down for various testing projects for my work.

A week or so ago I started getting notifications of "Critical Alerts" which came as a surprise as the TrueNAS build has been flawless up until now. I haven't had to do anything to the build since I set it up so I searched the forum here and found a few threads, one which was great and demonstrated how to scrub the pool and clear the errors. I also reseated the cables as recommended. That initially instilled some confidence, however the errors returned. At this point I decided to replace the SATA cables, while the number of errors seemed to diminish a little down to 1-2 per drive they didn't go away. At this point I decided to install a SATA adapter card I had purchased, because my original intent was to use 5 SSDs, this resulted in no change. The errors go away after I scrub and clear, however 1 or 2 sporadically reappear on 1-3 of the drives. The PNY drives don't support S.M.A.R.T. so I haven't been able to run the diagnostics.

So my question is - Is this normal? Should I be concerned or just be happy that ZFS is doing its thing and protecting the data?

Below is my build followed by a few glamour shots of the system. The red cables have been swapped out and a 6 port SATA adapter installed.

thanks!

TrueNAS-SCALE-22.12.3.3
Gigabyte A520i AC
AMD Ryzen 7 5700G
64GB RAM
(4) 2TB PNY CS900 - RAID Z2
(1) 2TB Crucial P3 NVMe (OS)
BEYIMEI PCIe 4X SATA 6 Port ASM1166/SATA 3.0


image0 (1).jpeg
image1 (1).jpeg
image2.jpeg



image3.jpeg
 

UDSGuy

Cadet
Joined
Sep 26, 2023
Messages
4
Update:

I haven't received any responses on here, however thought I would post an update anyway for anyone who reads this in the future. I found a thread on another forum that mentioned testing the memory so at random I decided to remove one of the 32GB memory modules to see what happens. I booted and ran two scrubs and the errors are gone. Previously I was getting 0-3 errors per SSD. I'll scrub it again a few times over the next few days for further testing.
 

UDSGuy

Cadet
Joined
Sep 26, 2023
Messages
4
I did some more testing. Pulled the DIMM and replaced it with the DIMM I thought was bad, no errors. Added the 2nd DIMM, still no errors.

The only thing other thing I've have changed was turning on AUTO TRIM immediately after I discovered the errors.

Currently, no errors. *sigh*
 

Belphegor

Dabbler
Joined
Mar 21, 2020
Messages
11
So my question is - Is this normal? Should I be concerned or just be happy that ZFS is doing its thing and protecting the data?
To answer your original question, this behaviour is not normal and you should definitely investigate why these errors happen.

At this stage I would recommend backing up your data (if not already done) and assess two things:
  1. perform RAM stability tests under load since the motherboard and CPU do not support ECC memory. To do this, run a memory test such as PassMark MemTest86 for at least 24 hours with both memory sticks in use. If errors are reported you might want to retry using only one memory stick. It could be a defective stick or the memory controller might not be able to operate reliably with both sticks using the current timings.
  2. The inability to retrieve SMART values on disks could be due to the fact that the drives are accessed through a SATA extension card. It would be recommended to connect the drives directly on the SATA ports on the motherboard and see if this helps in retrieving the diagnostic data.
 
Top