ECC vs non-ECC RAM and ZFS

Status
Not open for further replies.

nullfork

Cadet
Joined
Feb 25, 2014
Messages
5
Fair enough guys, all good points. I think my major failing was not realising that ZFS will detect a hash mismatch on a READ and then WRITE to fix it (or at least think it's fixed it). Taking that into account I can see how my proposal is flawed.

To be honest the end game with my suggestion was two fold; firstly to get a warning that you need to replace your RAM, and secondly to identify which particular file(s) are broken. I suppose if there was a way to just READ the file from disk without having ZFS do a hash check/fixup, then at least you could actually REALLY compare some hashes without ZFS stomping over the file system. It's also a shame that doing a complete RAM test is such an invasive procedure. If it could be done in userland along side a running NFS then that would be beneficial also.

FYI I'm not bothered about saving cash, I was just thinking about a way to give the thousands of non-ECC users out there an early warning. Personally I will be investing in ECC as soon as I've decided on the motherboard.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Fair enough guys, all good points. I think my major failing was not realising that ZFS will detect a hash mismatch on a READ and then WRITE to fix it (or at least think it's fixed it). Taking that into account I can see how my proposal is flawed.
Good. Too many people really just don't "get" how ECC RAM is better. And the price is a bit higher, so if they can't perceive the difference, then they won't spend the extra money.

To be honest the end game with my suggestion was two fold; firstly to get a warning that you need to replace your RAM, and secondly to identify which particular file(s) are broken. I suppose if there was a way to just READ the file from disk without having ZFS do a hash check/fixup, then at least you could actually REALLY compare some hashes without ZFS stomping over the file system. It's also a shame that doing a complete RAM test is such an invasive procedure. If it could be done in userland along side a running NFS then that would be beneficial also.
In fact, even if you mount a pool as read-only, ZFS will still fix problems it finds because its considered part of it's self-healing and not a condition where you are adding more transactions to the pool. I actually learned this first-hand. ;)


FYI I'm not bothered about saving cash, I was just thinking about a way to give the thousands of non-ECC users out there an early warning. Personally I will be investing in ECC as soon as I've decided on the motherboard.
I think that one of the reasons why it so hard to argue for the extra money is because too many people go with non-ECC RAM in their desktops and when things go bad they blame Windows. Everyone blames Windows for everything. On more than one occasion I've proved to friends(some that have degrees or certifications in this stuff).

Every time I've used a machine with bad RAM, the OS ended up trashed in the process. Why you'd expect less than that with ZFS is somewhat beyond me. Still others have never had a stick of RAM fail. I personally saw my first stick in 2005, then a second one in 2008, and then 2 more sticks in 2012. Up until 2005 I had a low tolerance for people claiming they'd seen large numbers(sometimes double-digit) sticks of RAM. In 2005 I worked in the military at a command that had over 1000 computers. We had 1 stick of RAM fail in those 1000 machines for the 3 years I was there. Finding the first stick of RAM that I could prove was *actually* bad RAM was like finding a pot of gold. For the previous 10 years I had never found one.

Even recently a friend mailed me a stick of bad RAM so I could do some ZFS experiments with it. Well, it got here and I wanted to see how bad the RAM was. I did almost 20 passes with memtest and never had an error. Just from my experience, I think that failed RAM is far far less frequent than many people realize. RAM tests can fail for other reasons too, and I think too many people aren't realizing this very simple fact. I really have to wonder how many sticks of RAM that are RMA'd really are bad. If I had to guess based on what I've seen, I'd guess less than 25%, and probably less than 10%.
 

nullfork

Cadet
Joined
Feb 25, 2014
Messages
5
Even recently a friend mailed me a stick of bad RAM so I could do some ZFS experiments with it. Well, it got here and I wanted to see how bad the RAM was. I did almost 20 passes with memtest and never had an error. Just from my experience, I think that failed RAM is far far less frequent than many people realize. RAM tests can fail for other reasons too, and I think too many people aren't realizing this very simple fact. I really have to wonder how many sticks of RAM that are RMA'd really are bad. If I had to guess based on what I've seen, I'd guess less than 25%, and probably less than 10%.


Well, I've been running a vanilla FreeBSD fileserver with mirrored disks (geom) for 10 years, and there has never been any issue with memory. In fact the OS never panic'd or anything like that and the data hasn't corrupted (as far as I'm aware ;-)). So I agree with you when you say that RAM failure isn't a frequent occurrance. I'm now moving to ZFS (primarily for data integrity) so it seems ironic that I'm moving to a safer software solution but at the same time that requires a safer hardware solution!! ..but when you consider how much data ZFS is piping in and out of RAM then you realise that you need to get "better" RAM. It seems to me that ZFS is concentrating hard on protecting your data, while you should concentrate hard on protecting your memory (so ZFS can protect your data) :)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Yeah, RAM failure is a small percentage, but with ZFS you just don't wanna be the percentage, heh.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, I've been running a vanilla FreeBSD fileserver with mirrored disks (geom) for 10 years, and there has never been any issue with memory. In fact the OS never panic'd or anything like that and the data hasn't corrupted (as far as I'm aware ;-)). So I agree with you when you say that RAM failure isn't a frequent occurrance. I'm now moving to ZFS (primarily for data integrity) so it seems ironic that I'm moving to a safer software solution but at the same time that requires a safer hardware solution!! ..but when you consider how much data ZFS is piping in and out of RAM then you realise that you need to get "better" RAM. It seems to me that ZFS is concentrating hard on protecting your data, while you should concentrate hard on protecting your memory (so ZFS can protect your data) :)

That is an excellent piece of wisdom. You see the whole picture. :)
 
Joined
Feb 25, 2014
Messages
1
Lots of great info in this thread, and a little scary for anyone who's set up ZFS without ECC. Thanks cyberjock for the writeup.

One clarification about backups that was confusing for me as I was reading: As long as you have multiple rsync or other offsite backups (such as --link-dest with rsync), your backups will be OK up to the point where your memory fails. So lets say you start getting scrub errors, and you detect the failure 2 weeks after the fact, you would at most lose 2 weeks of data. This is a great example of why you want to snapshot backups as well as your live data.

A question though: I'm currently running ZFS without ECC. Luckily I've had no scrub errors at all, so I know my data is currently OK. The thing is, I really like the ZFS toolset (and no RAID5 write hole), and migrating filesystems would be a big task. I saw that there's the option to set checksums=off, which applies to newly written data.
  1. If I turn off checksums and scrubs, is ZFS equivalent to any other non-checksumming filesystem as far as data integrity goes? Or are checksums assumed to be in place, and I'll be breaking other parts of ZFS in the process.
  2. Will turning off checksums stop any attempts to fix existing data that has checksums on disk?
  3. Can RAIDZ still work and handle disk failures without checksums?
Of course, since there's no fsck for ZFS, I'd be at risk if there were unexpected restarts, but I'm thinking of what I can do this instant as a best mitigation.

As an amusing aside, my system actually supports ECC, but it's an old 2005 era DDR2 motherboard, so ECC RAM is stupidly expensive to get now. I guess it's a good excuse to build something new!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
First, just because a scrub finds no corruption is NOT an indicator that "everything is okay". That's one reason why this is so disastrous.

Checksums ARE how ZFS finds corruption. You turn that off and you lose all corruption detection and repair features. Basically, if you are going to do this, you should just get rid of ZFS right now because you just neutered it. I'm not even sure I can think up all of the potential risks with doing this because it is just silly to even consider this idea.

There is no cheap mitigation for this. That's why I've made this thread such a big deal and done the long writeup. Unless you have backups that haven't be corrupted before the pool corruption existed, you are screwed. Unfortuanately, this is like saying you couldn't spend the money for ECC RAM, but have no problem spending $5000 for an amazing backup routine. If you had the money for the necessary backup routine you'd go with ECC RAM to begin with. ;)
 

R.G.

Explorer
Joined
Sep 11, 2011
Messages
96
First, just because a scrub finds no corruption is NOT an indicator that "everything is okay". That's one reason why this is so disastrous.
Let me add my vote to that. This is really simple.

Non-ECC memory allows random bit errors in memory. If this corrupted data is written out to disk, all the disk/filesystem work in the world will only perfectly preserve the corrupted data that is written to it. There is really no other way.

The easy availability and low marginal cost of ECC memory today means there is really no excuse for not using it for data you want to protect.

The marginal cost of using a different processor and motherboard pales against the subtle loss of your data.
 

Skar78

Dabbler
Joined
Mar 18, 2014
Messages
15
So, after completely trusting this thread and aiming for a ECC board with ITX format, while already sitting on an I3-4130T, i went absolutely nuts tofind a source to buy either a Asus P9D-I or a Gigabyte GA-6LISL or as an emergency solution is the Asrock E3C226D2I. (Apparently most of them EoL already.)

µATX seems to be the way to go, unfortunately i kind of want to go with a Silverstone DS380 or a Lian Li Q35B - which claim to be targeted at NAS, but wtf so difficult to buy a board.

The x9 supermicro series i can not consider as i have the I3 already... kinda frustrating.
 

DJABE

Contributor
Joined
Jan 28, 2014
Messages
154
Welcome to the club Skar78!
I wanted to order ASRock E3C224D2L, but that's near impossible in the whole EUrope.

It's a shame but you really can't find 'small form factor' server board!
 

Skar78

Dabbler
Joined
Mar 18, 2014
Messages
15
Welcome to the club Skar78!
I wanted to order ASRock E3C224D2L, but that's near impossible in the whole EUrope.

It's a shame but you really can't find 'small form factor' server board!

I think the Asrock I could manage to get (know ppl who work there). It is however not my first choice as i still associate "value" and not "reliability" with the brand.
 

DJABE

Contributor
Joined
Jan 28, 2014
Messages
154
Those are "server-grade" mainboards. OFC shit happens, so if you are not lucky you might get problematic MB. But I would get ASRock mobo if I could...
 

Skar78

Dabbler
Joined
Mar 18, 2014
Messages
15
Those are "server-grade" mainboards. OFC shit happens, so if you are not lucky you might get problematic MB. But I would get ASRock mobo if I could...

I went with the asus P9D-I - will only take forever to travel around the globe to reach me. Most of those MB seem EoL.
 

panz

Guru
Joined
May 24, 2013
Messages
556
I went with the asus P9D-I - will only take forever to travel around the globe to reach me. Most of those MB seem EoL.

Only my opinion, but I would not buy a mobo that "will only take forever to travel around the globe to reach me" just because - in the event of a motherboard failure - you'll have to wait a long time to get a replacement.
 

Skar78

Dabbler
Joined
Mar 18, 2014
Messages
15
Only my opinion, but I would not buy a mobo that "will only take forever to travel around the globe to reach me" just because - in the event of a motherboard failure - you'll have to wait a long time to get a replacement.

Totally agree, tried to avoid that, but in the end still wanted ECC and ITX.
 

Bruno Salvador

Dabbler
Joined
Mar 27, 2014
Messages
25
Very good article, unfortunatelly I read it too late, I already bought the parts.

The main usage is for home storage and some important data in mirrored hds.

CPU: Intel I3-4130T 2.90 3 LGA 1150 Processor BX80646I34130T
MEM: Kingston HyperX Blu 8GB (1x8 GB Module) 1600MHz 240-pin DDR3 Non-ECC CL10 Desktop Memory KHX1600C10D3B1/8G
MB: ASUS H87I-PLUS LGA 1150 Intel H87 Mini ITX Motherboard
Case: fractal design Node 304

Only CPU has ECC capability.

Today I have the system on ZFS mode. I will change it to UFS as soon as possible to avoid ZFS problems with non-ECC hardware.

I dont see a need for ZFS for now, so I'll keep this hardware.
If I have opportunity to travel to USA, then I can buy good hardware, but here in my country, its too expensive.

I'm loving the FreeNAS, specially the jails where I can install anything (webserver, air video streaming, owncloud, etc)
 

Joshua Parker Ruehlig

Hall of Famer
Joined
Dec 5, 2011
Messages
5,949
I'd still stick to ZFS, it still remediates bitrot and has datasets/snapshots. no need to go to UFS unless you don't have enough ram, IMHO.
 

Joshua Parker Ruehlig

Hall of Famer
Joined
Dec 5, 2011
Messages
5,949
I have 8GB , the system uses is constant at ~6GB in ZFS.
that's zfs using any leftover ram, up to a % for caching and is a good thing. going ufs won't really benefit you and doesn't mean you're saved from issues from non-ECC ram.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
going ufs won't really benefit you and doesn't mean you're saved from issues from non-ECC ram.

UFS will not prevent bitrot but it also will not cause further damage just from the RAM going bad and ZFS trying to "correct" the nonexistent errors.
 
Status
Not open for further replies.
Top