ECC vs non-ECC RAM and ZFS

Status
Not open for further replies.

TXAG26

Patron
Joined
Sep 20, 2013
Messages
310
I actually JUST posted the part below in another thread before reading the last couple of pages of this one and felt it needed repeating since some folks just don't get the importance of running the proper hardware, especially ECC ram. This is real-world how stuff could have gone south real fast with my ZFS RaidZ2 pool had it not been for the hard-knock lessons learned by others. Don't even bother running ZFS if you aren't willing to run ECC ram.

http://forums.freenas.org/index.php...lesector-offline-uncorrectable-sectors.22131/

------------------------
Wow, talk about a rough week for hardware failures! After swapping out the HDD that was throwing all the errors and resilvering the RaidZ2 pool, I shut the SM X10SL7-F down and did a cold reboot. Wouldn't you know, the darn thing wouldn't come back up! The screen would stay blank and a steady 4 bios beeps would sound. One each second, for four seconds. Found out this was a fatal memory error as not even the bios screen would come up.

Luckily, I had a spare 16GB kit that I was about to install in my Supermicro X10SAE workstation. Got it installed and the server booted right back up! I tried the defective ram on a different X10 board and had the same 4 beeps, so I'm pretty sure the ram has failed. I've never had ram fail before, and it was just dumb luck on the timing to have a spare set handy!

Once the server was back online, I fired up IPMI View and noticed some Correctable ECC events in the IPMI System Event Log! Scarily, these dates/times correlate exactly to when the monthly zpool scrub fires off! Yikes! I have since completed a new scrub, so would it be safe to say I likely dodged any possible data corruption? Thank goodness for ECC ram. Anyone who runs ZFS without it is asking for serious trouble!

Code:
202,System Event,06/24/2014 06:45:22 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
203,System Event,06/24/2014 06:48:19 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
204,System Event,06/24/2014 07:16:10 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
205,System Event,06/24/2014 07:16:45 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
206,System Event,06/24/2014 07:24:50 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
207,System Event,06/24/2014 07:25:39 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
208,System Event,06/24/2014 07:31:36 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
209,System Event,06/24/2014 07:33:58 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
210,System Event,06/24/2014 07:33:58 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
211,System Event,06/24/2014 07:34:43 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
212,System Event,06/24/2014 07:34:58 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
213,System Event,06/24/2014 07:49:10 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
214,System Event,06/24/2014 08:19:54 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
215,System Event,06/24/2014 08:22:32 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
216,System Event,06/24/2014 09:32:00 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
217,System Event,06/24/2014 09:50:07 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
218,System Event,06/24/2014 10:09:11 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
219,System Event,06/24/2014 18:19:01 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
220,System Event,06/24/2014 18:19:02 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
221,System Event,06/25/2014 00:25:37 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
222,System Event,06/25/2014 00:25:37 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
223,System Event,06/25/2014 01:14:29 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
224,System Event,06/25/2014 05:15:03 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
225,System Event,06/25/2014 05:15:04 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
226,System Event,06/25/2014 07:01:47 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
227,System Event,06/25/2014 08:06:15 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
228,System Event,06/25/2014 18:46:34 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
229,System Event,06/26/2014 14:42:42 Thu,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
230,System Event,06/26/2014 14:42:43 Thu,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
I actually JUST posted the part below in another thread before reading the last couple of pages of this one and felt it needed repeating since some folks just don't get the importance of running the proper hardware, especially ECC ram. This is real-world how stuff could have gone south real fast with my ZFS RaidZ2 pool had it not been for the hard-knock lessons learned by others. Don't even bother running ZFS if you aren't willing to run ECC ram.

http://forums.freenas.org/index.php...lesector-offline-uncorrectable-sectors.22131/

------------------------
Wow, talk about a rough week for hardware failures! After swapping out the HDD that was throwing all the errors and resilvering the RaidZ2 pool, I shut the SM X10SL7-F down and did a cold reboot. Wouldn't you know, the darn thing wouldn't come back up! The screen would stay blank and a steady 4 bios beeps would sound. One each second, for four seconds. Found out this was a fatal memory error as not even the bios screen would come up.

Luckily, I had a spare 16GB kit that I was about to install in my Supermicro X10SAE workstation. Got it installed and the server booted right back up! I tried the defective ram on a different X10 board and had the same 4 beeps, so I'm pretty sure the ram has failed. I've never had ram fail before, and it was just dumb luck on the timing to have a spare set handy!

Once the server was back online, I fired up IPMI View and noticed some Correctable ECC events in the IPMI System Event Log! Scarily, these dates/times correlate exactly to when the monthly zpool scrub fires off! Yikes! I have since completed a new scrub, so would it be safe to say I likely dodged any possible data corruption? Thank goodness for ECC ram. Anyone who runs ZFS without it is asking for serious trouble!

Code:
202,System Event,06/24/2014 06:45:22 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
203,System Event,06/24/2014 06:48:19 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
204,System Event,06/24/2014 07:16:10 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
205,System Event,06/24/2014 07:16:45 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
206,System Event,06/24/2014 07:24:50 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
207,System Event,06/24/2014 07:25:39 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
208,System Event,06/24/2014 07:31:36 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
209,System Event,06/24/2014 07:33:58 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
210,System Event,06/24/2014 07:33:58 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
211,System Event,06/24/2014 07:34:43 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
212,System Event,06/24/2014 07:34:58 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
213,System Event,06/24/2014 07:49:10 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
214,System Event,06/24/2014 08:19:54 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
215,System Event,06/24/2014 08:22:32 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
216,System Event,06/24/2014 09:32:00 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
217,System Event,06/24/2014 09:50:07 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
218,System Event,06/24/2014 10:09:11 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
219,System Event,06/24/2014 18:19:01 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
220,System Event,06/24/2014 18:19:02 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
221,System Event,06/25/2014 00:25:37 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
222,System Event,06/25/2014 00:25:37 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
223,System Event,06/25/2014 01:14:29 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
224,System Event,06/25/2014 05:15:03 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
225,System Event,06/25/2014 05:15:04 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
226,System Event,06/25/2014 07:01:47 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
227,System Event,06/25/2014 08:06:15 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
228,System Event,06/25/2014 18:46:34 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
229,System Event,06/26/2014 14:42:42 Thu,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
230,System Event,06/26/2014 14:42:43 Thu,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)

Indeed your ass was saved... Since they're all correctable, they were corrected. The system is supposed to halt (dunno at what level - if it's a panic or if the CPU is just halted) at uncorrectable errors, to protect data.
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Indeed your ass was saved... Since they're all correctable, they were corrected. The system is supposed to halt (dunno at what level - if it's a panic or if the CPU is just halted) at uncorrectable errors, to protect data.

My understanding is that the memory controller issues a halt command to the CPU when the error is uncorrectable. But as I can't find anyone that can answer these kinds of questions I really can't validate this to 100% certainty. :(
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
My understanding is that the memory controller issues a halt command to the CPU when the error is uncorrectable. But as I can't find anyone that can answer these kinds of questions I really can't validate this to 100% certainty. :(

If I ever get access to some spare ECC hardware and a high-energy gamma-ray source, I'll run an experiment and tell you how it goes. Sounds like the craziest and fastest way to generate ECC errors.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
If I ever get access to some spare ECC hardware and a high-energy gamma-ray source, I'll run an experiment and tell you how it goes. Sounds like the craziest and fastest way to generate ECC errors.

Oh, I've done some things like that. But what actually initiates the halt command is what's question (I think anyway).
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
Oh, I've done some things like that. But what actually initiates the halt command is what's question (I think anyway).

Well, the question was at what level. If it's the CPU/memory controller, the system should just hang, if it's the OS, it should panic.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The OS definitely doesn't panic. The halt is definitely a hardware thing. But whether its the CPU, the memory controller, or some BIOS feature that must be enabled is what I was curious about.
 

TXAG26

Patron
Joined
Sep 20, 2013
Messages
310
There is an option in most bios called "Halt on Error" and most vendors have it enabled by default. Over the years, that is something I've always manually changed to "disabled", especially with remote systems. However, I really never understood what that actuall did, so I want to look into that feature in more depth and might re-enable it, at least on FreeNAS systems.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There is an option in most bios called "Halt on Error" and most vendors have it enabled by default. Over the years, that is something I've always manually changed to "disabled", especially with remote systems. However, I really never understood what that actuall did, so I want to look into that feature in more depth and might re-enable it, at least on FreeNAS systems.

Yeah, that should *always* be enabled, even more so in production systems. It's far better to halt the machine that trash your OS, your applications, files, and almost always your file system. The box will certainly eventually be unviable, so you can either crash it on the first sign of trouble or wait until it's unusable and then you find out your last 2 months of backups are nonviable too.

Most stuff I've seen lately doesn't give the option anymore. They halt and there's nothing you can do about it.
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
So I've noticed something since switching from non-ECC to ECC - I see the physical memory utilization climb up to near full capacity when doing tasks such as extracting a large (10GB) rar file. But once the task is complete, the memory utilization remains at its last peak and doesn't seem to flush until a reboot. I didn't see this in non-ECC RAM.
  1. Could someone explain what is happening, please?
  2. Is it okay for the utilization to sit at near capacity?
  3. I would think that would be cause for concern for future processes.
  4. Is this normal?
  5. Does it eventually flush on its own?
 

panz

Guru
Joined
May 24, 2013
Messages
556
I have 32 GB of ECC RAM and it's always at max utilization :)
 

krikboh

Patron
Joined
Sep 21, 2013
Messages
209
Did you increase the amount of RAM when you switched to ECC? FreeNAS does not perform its optimal caching to RAM when short on memory. You may be seeing proper behavior now if you increased the memory.
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
Nope, went from 16GB non-ECC to 16GB ECC
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I can guarantee you that switching to ECC RAM has no bearing on the way ZFS works. In fact, there's no sure-fire way to prove you are running ECC or not! Each Intel chip does things slightly different, and AMD is a f'n mess to prove (if you even can).

So yeah, unless you can provide video proof I'm gonna say you are missing a piece of the pie. ;)
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
Huh? I'm running a G3420 on a X10 mobo.
 

Knowltey

Patron
Joined
Jul 21, 2013
Messages
430
So I've noticed something since switching from non-ECC to ECC - I see the physical memory utilization climb up to near full capacity when doing tasks such as extracting a large (10GB) rar file. But once the task is complete, the memory utilization remains at its last peak and doesn't seem to flush until a reboot. I didn't see this in non-ECC RAM.
  1. Could someone explain what is happening, please?
  2. Is it okay for the utilization to sit at near capacity?
  3. I would think that would be cause for concern for future processes.
  4. Is this normal?
  5. Does it eventually flush on its own?

Yes, that is normal. FreeNAS is designed to use as much RAM as you give it.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
If you really want to flush RAM, you can dump a bunch from /dev/urandom to a file and then delete the file. There's no purpose for this, but it should free some memory.

Snapshots may or may not matter here.
 

Knowltey

Patron
Joined
Jul 21, 2013
Messages
430
If you really want to flush RAM, you can dump a bunch from /dev/urandom to a file and then delete the file. There's no purpose for this, but it should free some memory.

Snapshots may or may not matter here.
Yeah but like fracas said that would be pointless. After takin up what it needs to run it just uses the rest as cache space and if eventually needed will be unreserved out on longest since used fire I basis to clear space for system processes.
 

IonutZ

Contributor
Joined
Aug 17, 2014
Messages
108
Noob question, so in reading the op I've come to the conclusion that in order to run ZFS, you need to have ECC Ram. I have an i5 on an ASUS mobo with 16GB non ecc ddr3@1600. I'm guessing even if I bought ECC ram for it, the ECC function would be unusable. So then, I would literally have to replace mobo,cpu,ram in order to support the "Data Integrity" function of my fileserver... and there isn't any way around it. Am I correct?
 

Pharfar

Dabbler
Joined
Jan 6, 2013
Messages
46
IonutZ, you are correct. You need a CPU and a mobo that supports ECC.
 
Status
Not open for further replies.
Top