Possible memory corruption despite ECC?

Status
Not open for further replies.

bh1

Cadet
Joined
Nov 28, 2015
Messages
4
A routine scrub last night generated the "One or more devices has experienced an error resulting in data corruption. Applications may be affected."

Drives in the pool are are online and smartctl shows no errors on any of the drives. Output of "zpool status -v" is below, and as you can see there are several checksum errors but no read or write errors.

A few questions:
  • This seems like it has to be a memory corruption error, right? But I'm using ECC RAM ... is that even possible? I did run memtest when the server was installed, but that was a few years ago.
  • Can I trust zpool when it says that the data corruption is confined to "volume1/testzvol"? It is a 30TB array, and I'd really prefer not to restore the whole thing from backup if I don't have to
  • What the heck do I do now?

Running FreeNAS-11.0-U1 (aa82cc58d) on a SuperMicro SC846E16-R1200B with BPN-SAS2-846EL1, X8DTE-F and 48GB (12x 4GB PC3-10600R) 2x Intel Xeon E5640 Quad Core

Code:
  pool: volume1

 state: ONLINE

status: One or more devices has experienced an error resulting in data

corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

entire pool from backup.

   see: http://illumos.org/msg/ZFS-8000-8A

  scan: scrub repaired 148K in 29h21m with 2 errors on Mon Jan 22 05:21:35 2018

config:


NAME											STATE	 READ WRITE CKSUM

volume1										 ONLINE	   0	 0	 2

  raidz3-0									  ONLINE	   0	 0	 4

	gptid/ef32b9b5-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/eff4c195-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f0b1e9d0-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 3

	gptid/f1765d4c-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 7

	gptid/f23b41b9-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f2fd6619-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f3c44866-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 6

	gptid/f4861cf1-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 8

	gptid/f55a468b-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f614933a-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 0

	gptid/f6d98ade-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 5

	gptid/f79ba1a6-9173-11e5-8ee5-0025906b9b9a  ONLINE	   0	 0	 8


errors: Permanent errors have been detected in the following files:


		volume1/testzvol:<0x1>
 
Last edited by a moderator:

rs225

Guru
Joined
Jun 28, 2014
Messages
878
That is an unusual pattern. I will make a wild guess that it is caused by transient read corruption. The data is fine on disk, but it gets to memory with problems. A second scrub will probably turn up a different set of results, which would confirm you have inconsistent reads.

Have your past scrubs been turning up anything?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Looks more like a HBA problem or PSU problem.

If it was a RAM error then ECC would have corrected it if it can or halted the system otherwise.
 

bh1

Cadet
Joined
Nov 28, 2015
Messages
4
That is an unusual pattern. I will make a wild guess that it is caused by transient read corruption. The data is fine on disk, but it gets to memory with problems. A second scrub will probably turn up a different set of results, which would confirm you have inconsistent reads.

Have your past scrubs been turning up anything?

Thank you very much for the quick reply!

No, this is the first scrub that has ever generated any errors or warnings.

That theory makes sense ... I guess also consistent with Bidule0hm's suggestion that it might be an HBA problem?

If that is what is going on, should I disable scrubs for the time being? I worry that it will find what it thinks is an error, and create more problems by trying to fix it.
 

bh1

Cadet
Joined
Nov 28, 2015
Messages
4
Looks more like a HBA problem or PSU problem.

If it was a RAM error then ECC would have corrected it if it can or halted the system otherwise.

Thank you very much for the quick reply!

I should have mentioned that the system IPMI log shows no errors. Not sure if it would have picked up on a PSU problem, but mentioning in case it is relevant.

HBA is an IBM 1015 in IT mode. If it is an HBA issue, are there any diagnostics that I can do or logfiles I should check without bringing FreeNAS down? Unfortunately I am not in the same physical location as the server.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
I guess also consistent with Bidule0hm's suggestion that it might be an HBA problem?
Yes.

If that is what is going on, should I disable scrubs for the time being? I worry that it will find what it thinks is an error, and create more problems by trying to fix it.
You could set them on an infrequent basis, but there really isn't much risk there.
 

bh1

Cadet
Joined
Nov 28, 2015
Messages
4
Yes.

You could set them on an infrequent basis, but there really isn't much risk there.

OK great thank you again. So basically I should just replace the HBA ASAP but nothing to do until then?

Hope you don't mind if I ask one more question -- do I need to do anything special to replace the HBA? Or just shutdown, swap cards and restart? Sorry I know it's sort of a dumb question, but I googled it and didn't see an answer and I don't want to mess this up...
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Well, unless you have a spare HBA or you can connect the disks (even some of them) directly to the MB I don't know how too rule out the HBA.

FreeNAS doesn't care if you swap the HBA or change the disks order as long as it see all the disks ;)
 
Status
Not open for further replies.
Top