Non-ECC and ZFS Scrub?

ZFS Noob · Dec 7, 2013

How are you supposed to do backups when your file data will be read into RAM, trashed by your bad RAM, then handled by your given app(rsync, cp, ESXi tool, whatever)? The WHOLE problem is that ZFS will crap all over your data before any application can use it. Garbage in = garbage out.

I'll agree bad memory will screw you. But I'm going to stick with my assertion that an aggressive backup policy is a good one. At the very least it should mean that you can go back and recover your intact files as they existed before your RAM issue popped up.

I will also contend that multiple backup systems that overlap is a good one. My local backup server died a few days ago, but I'm still pulling regular backups off-site via another backup server, so I have a bit more flexibility in when I travel to the datacenter to do hardware work. If a machine needs to be restored it will be hours to get it back up rather than < 20 minutes, but the data should still be safe (knock on wood).

Of course, you're welcome to disagree. :)

cyberjock · Dec 7, 2013

I disagree only in the fact that most people keep only 1 "full" separate backup. For example, your FN server backs up to another FN server via snapshots.

Most people don't have a 3rd machine that sits off for 1 month, then the 2 backup servers switch. If you were to be in that situation, and as long as your RAM hadn't been bad since you last swapped backup servers, you'd be okay.

But by far and large 99%+ of the users that do backups have a live backup setup that they sync with their FN server. That means if the FN server goes bad, the live will inevitably be trashed too.

survive · Dec 8, 2013

Hi Tingo & ZFS Noob,

Take a look at this: http://cache-www.intel.com/cd/00/00/46/78/467819_467819.pdf

pages 6 & 7.

I assume we can all agree that these guys might now something about the subject.

-Will

jgreco · Dec 8, 2013

Knowltey said:
Now in the event of a ZFS scrub it is reading every bit of data on the entire pool. performing said checksum on every single bit of data on the entire pool, and as such "correcting" every bit of data on said pool. Voila! Now ever single bit of data on your entire pool is complete garbage.

And just for the sake of correctness I will point out that the idea here is correct, but you are much more likely to see a single bit failure corrupting a page of memory, This has more complex effects; in particular, since ZFS uses lots of system memory, only a small percentage of ZFS blocks might be corrupted. But more disturbingly, a block corrupted and written to pool remains corrupted even if your system memory is then fixed or even replaced with ECC: ZFS has no good tool to fix this sort of corruption (lack of fsck, etc). If you are LUCKY, it will be file data destroyed, but if it is metadata, it could doom your pool.

ZFS Noob · Dec 8, 2013

I assume we can all agree that these guys might now something about the subject.

Wow.

That's a whole lot more frequent than I expected...

jgreco · Dec 8, 2013

I am skeptical of those claims. If that was accurate, those of us with ECC would be logging events constantly. I suspect that they've taken overall statistics and restated the results in an ECC-promoting marketing move by a manufacturer looking to promote Xeon.

If they were to say that there's a 10% annual chance of a module developing faults leading to system errors, that would be more plausible.

Local experience is that RAM typically exhibits no errors once it passes DOA, burn-in and QA. But it can develop failures later. That's where ECC saves you. But even there, the ideal goal is to use ECC to correct and detect, alerting the admin to a failing module in need of replacement.

It isn't that different from how we treat hard drives. Most of the people here don't suggest deploying RAID0 stripes. Even RAIDZ1 is frowned upon. We understand the fallibility of drives but some are willing to endlessly debate whether or not ECC is necessary ...

ZFS Noob · Dec 8, 2013

Here's another look at memory error rates:

http://news.cnet.com/8301-30685_3-10370026-264.html

Previous research, such as some data from a 300-computer cluster, showed that memory modules had correctable error rates of 200 to 5,000 failures per billion hours of operation. Google, though, found the rate much higher: 25,000 to 75,000 failures per billion hours.

And another, from the Guild Wars folks:

He wrote a module (“OsStress”) which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second.

On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!

From here: http://www.codeofhonor.com/blog/whose-bug-is-this-anyway

survive · Dec 8, 2013

jgreco said:
I am skeptical of those claims. If that was accurate, those of us with ECC would be logging events constantly. I suspect that they've taken overall statistics and restated the results in an ECC-promoting marketing move by a manufacturer looking to promote Xeon.

I'm sure Intel cherry-picked the data they used in the presentation, but they do cite 3rd-party research into how often you get memory errors.

-Will

jgreco · Dec 9, 2013

Yes, but if 1 in 10 people has cancer and 1 in 10000 cells in their body is cancerous, it is correct to say approximately 1 in 100,000 human body cells are cancerous ... but misleading to use this to then claim that because your body has 100 trillion cells, you must therefore have a billion cancerous cells in your body.

I am perfectly fine with factual statements that are not misleading. This document, however, appears to be more of a marketing gimmick than a coldly factual discussion. For example, I am perfectly fine conceding that well-tested and burned-in name brand non-ECC memory is not that likely to cause problems, especially between months 2 and 24. A maybe 5% failure rate might be acceptable to some people. Me, I won't be doing it because I discard valueless data, and would like to retain the rest, and spend money to that end. Bad memory can ruin pools. ECC is a good (but not 100%) prophylactic. I'm happy to pay it, and to spend some time educating others as to the actual risks.

kzrussian · Dec 9, 2013

Question for cyberjock:
In one of your other (very informative) threads you made this comment

FreeNAS has options for those that want to get involved without necessarily requiring ECC RAM. UFS and a hardware RAID or even UFS' own raid functions are probably sufficient for most people. You can also choose to disable ZFS' checksums and parity and effectively strip out the resilience if you so desire.
http://forums.freenas.org/threads/ecc-vs-non-ecc-ram-and-zfs.15449/page-3

English is not my native language, so could you please explain what the underlined part of your quote means?
I was wondering if disabling checksums on ZFS and not performing any scrubs - will prevent completely distroying the pool?
I do realize that with a bad non-ecc RAM - I will be corrupting the newly written data untill I realize I have bad ram, but i'm thinking at least all my old data will be good.

I'm searching for NAS software that will allow me to configure software RAID10. I tried FreeNAS with UFS, but like I mentioned in my post above - UFS is not compatible with softRAID10. Now i'm trying to figure out if i should try ZFS and disable checksums. If that's not a solution, i guess i'll have to search for a linux based NAS and use ext4.

cyberjock · Dec 9, 2013

kzrussian said:
I was wondering if disabling checksums on ZFS and not performing any scrubs - will prevent completely distroying the pool?
I do realize that with a bad non-ecc RAM - I will be corrupting the newly written data untill I realize I have bad ram, but i'm thinking at least all my old data will be good.

That's what I was trying to say. But, there's other possible dangers still. If your zpool metadata gets trashed your pool might be unmountable anyway. So even though 90% of your data might be okay, bad metadata can ruin the pool enough to make none of the data accessible.

jgreco · Dec 9, 2013

It will just as happily corrupt data that you READ ... correcting it and helpfully writing it back to disk. Avoiding all writes AND READS (including scrubs) is the only truly safe option.

kzrussian · Dec 9, 2013

cyberjock, here is another quote from you:

Personally, I'd rank things in this order:

1. ZFS with ECC RAM
2. Other file system with hardware RAID with non-ECC RAM
3. ZFS with non-ECC.

Where in this ranking would you place "ZFS with non-ECC, disabled checksums, disabled scrub"?

cyberjock · Dec 9, 2013

Honestly, probably just a smidge above ZFS with non-ECC.

tingo · Dec 29, 2013

survive said:
Hi Tingo & ZFS Noob,

Take a look at this: http://cache-www.intel.com/cd/00/00/46/78/467819_467819.pdf

pages 6 & 7.

I assume we can all agree that these guys might now something about the subject.

And you are sure that they are not trying to sell us ECC RAM?

tingo · Dec 29, 2013

jgreco said:
No. But we've seen people who have lost pools, and who were told to test their memory, and found it to be faulty.

So connect the dots. Why are we discussing this, again?

I was just asking for evidence - it would have been nice to have some.
As some of you keep telling us; if this is happening in some numbers, there should be someone around with evidence to back up the claims.

Knowltey · Dec 29, 2013

tingo said:
And you are sure that they are not trying to sell us ECC RAM?

Intel doesn't make RAM, they have no interest in trying to sell it to us.

tingo said:
As some of you keep telling us; if this is happening in some numbers, there should be someone around with evidence to back up the claims.

In the sticky about ECC RAM there are a few links to threads made by individuals that lost their pools due to not using ECC RAM.

cyberjock · Dec 29, 2013

tingo said:
I was just asking for evidence - it would have been nice to have some.
As some of you keep telling us; if this is happening in some numbers, there should be someone around with evidence to back up the claims.

No, that's where you are completely wrong. That is also why this RAM argument will never go away. There is no way to definitively prove that if something goes wrong that it was RAM. You've got gigabytes of data per second moving around to all sorts of different devices in your computer. Can you definitely tell us how you'd prove where that bit got lost? Exactly.

So you either drink the kool-aid or you don't. You either take the risks or you don't. The choice is yours, and I give 0 sympathy points for people that use non-ECC when this is a well known problem.

jgreco · Dec 29, 2013

tingo said:
I was just asking for evidence - it would have been nice to have some.
As some of you keep telling us; if this is happening in some numbers, there should be someone around with evidence to back up the claims.

It isn't my job to collect evidence. I am not going to prove to you that it is best practice to wear your seat belt when driving in your car. I have no reason to do that work. I'm polite enough to tell you that it is best practice, but if you want to not believe it, that's fine, and I'll even cross my fingers that you never find yourself flying out through your windshield into the arse of the truck you accidentally rear-end some fine day.

You can use the forum search features and do your own research.

DrKK · Dec 29, 2013

The experts on FreeNAS (as published on this forum and elsewhere), ZFS (as published commercially), and BSD (as published on their forums), *unanimously* agree that ECC memory is necessary for this application if data integrity is to meet the standard ZFS presupposes. Period. And that's it. That is the alpha, and the omega.

An expert is not on the hook to establish his expertise on demand to a non-paying customer. Instead, his track record will speak (or not) to his expertise. Alpha. Omega.

I will join with jgreco in wishing the nay-sayers the best of luck, however.

Important Announcement for the TrueNAS Community.

Non-ECC and ZFS Scrub?

Contributor

Inactive Account

Behold the Wumpus

Resident Grinch

Contributor

Resident Grinch

Contributor

Behold the Wumpus

Resident Grinch

Cadet

Inactive Account

Resident Grinch

Cadet

Inactive Account

Contributor

Contributor

Patron

Inactive Account

Resident Grinch

FreeNAS Generalissimo

Similar threads