Possible memory corruption despite ECC?

bh1 · Jan 22, 2018

A routine scrub last night generated the "One or more devices has experienced an error resulting in data corruption. Applications may be affected."

Drives in the pool are are online and smartctl shows no errors on any of the drives. Output of "zpool status -v" is below, and as you can see there are several checksum errors but no read or write errors.

A few questions:

This seems like it has to be a memory corruption error, right? But I'm using ECC RAM ... is that even possible? I did run memtest when the server was installed, but that was a few years ago.
Can I trust zpool when it says that the data corruption is confined to "volume1/testzvol"? It is a 30TB array, and I'd really prefer not to restore the whole thing from backup if I don't have to
What the heck do I do now?

Running FreeNAS-11.0-U1 (aa82cc58d) on a SuperMicro SC846E16-R1200B with BPN-SAS2-846EL1, X8DTE-F and 48GB (12x 4GB PC3-10600R) 2x Intel Xeon E5640 Quad Core

Code:

  pool: volume1

 state: ONLINE

status: One or more devices has experienced an error resulting in data

corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

entire pool from backup.

   see: http://illumos.org/msg/ZFS-8000-8A

  scan: scrub repaired 148K in 29h21m with 2 errors on Mon Jan 22 05:21:35 2018

config:


NAME											STATE	 READ WRITE CKSUM

volume1										 ONLINE	   0	 0	 2

  raidz3-0									  ONLINE	   0	 0	 4

	gptid/ef32b9b5-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/eff4c195-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f0b1e9d0-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 3

	gptid/f1765d4c-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 7

	gptid/f23b41b9-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f2fd6619-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f3c44866-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 6

	gptid/f4861cf1-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 8

	gptid/f55a468b-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f614933a-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f6d98ade-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 5

	gptid/f79ba1a6-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 8


errors: Permanent errors have been detected in the following files:


		volume1/testzvol:<0x1>

rs225 · Jan 22, 2018

That is an unusual pattern. I will make a wild guess that it is caused by transient read corruption. The data is fine on disk, but it gets to memory with problems. A second scrub will probably turn up a different set of results, which would confirm you have inconsistent reads.

Have your past scrubs been turning up anything?

Bidule0hm · Jan 22, 2018

Looks more like a HBA problem or PSU problem.

If it was a RAM error then ECC would have corrected it if it can or halted the system otherwise.

bh1 · Jan 22, 2018

rs225 said:
That is an unusual pattern. I will make a wild guess that it is caused by transient read corruption. The data is fine on disk, but it gets to memory with problems. A second scrub will probably turn up a different set of results, which would confirm you have inconsistent reads.

Have your past scrubs been turning up anything?

Thank you very much for the quick reply!

No, this is the first scrub that has ever generated any errors or warnings.

That theory makes sense ... I guess also consistent with Bidule0hm's suggestion that it might be an HBA problem?

If that is what is going on, should I disable scrubs for the time being? I worry that it will find what it thinks is an error, and create more problems by trying to fix it.

bh1 · Jan 22, 2018

Bidule0hm said:
Looks more like a HBA problem or PSU problem.

If it was a RAM error then ECC would have corrected it if it can or halted the system otherwise.

Thank you very much for the quick reply!

I should have mentioned that the system IPMI log shows no errors. Not sure if it would have picked up on a PSU problem, but mentioning in case it is relevant.

HBA is an IBM 1015 in IT mode. If it is an HBA issue, are there any diagnostics that I can do or logfiles I should check without bringing FreeNAS down? Unfortunately I am not in the same physical location as the server.

rs225 · Jan 22, 2018

bh1 said:
I guess also consistent with Bidule0hm's suggestion that it might be an HBA problem?

Yes.

bh1 said:
If that is what is going on, should I disable scrubs for the time being? I worry that it will find what it thinks is an error, and create more problems by trying to fix it.

You could set them on an infrequent basis, but there really isn't much risk there.

bh1 · Jan 22, 2018

rs225 said:
Yes.

You could set them on an infrequent basis, but there really isn't much risk there.

OK great thank you again. So basically I should just replace the HBA ASAP but nothing to do until then?

Hope you don't mind if I ask one more question -- do I need to do anything special to replace the HBA? Or just shutdown, swap cards and restart? Sorry I know it's sort of a dumb question, but I googled it and didn't see an answer and I don't want to mess this up...

Bidule0hm · Jan 22, 2018

Well, unless you have a spare HBA or you can connect the disks (even some of them) directly to the MB I don't know how too rule out the HBA.

FreeNAS doesn't care if you swap the HBA or change the disks order as long as it see all the disks ;)

Important Announcement for the TrueNAS Community.

Possible memory corruption despite ECC?

bh1

Cadet

rs225

Guru

Bidule0hm

Server Electronics Sorcerer

bh1

Cadet

bh1

Cadet

rs225

Guru

bh1

Cadet

Bidule0hm

Server Electronics Sorcerer

Similar threads