ECC vs non-ECC RAM and ZFS

ss4johnny · Dec 10, 2013

After switching to ECC RAM, I have been error free in copying over about 5TB worth of data. I've put the old components into an HTPC. Since I was in kind of a hurry in putting together the NAS and less of a hurry for the HTPC, I ran memtest86+ on the non-ECC RAM. It turns out that I got all kinds of errors on one of the sticks (duh, in retrospect). Returned that and new RAM has no errors. Anyway, the moral of the story is: don't be in a hurry putting together NAS with ZFS. Make damn sure you run memtest before bothering to set anything up, especially if you have non-ECC RAM. You'll save yourself time in the long run.

jyavenard · Dec 10, 2013

snicker said:
On the screen: "Fatal Trap 12".... rebooted and watched it happen after it tried to mount the filesystems.

you should raise a bug with FreeBSD... No matter the state of the pool, it should never cause a kernel panic when trying to mount it now that you've fixed the RAM.

I'm sure they will be very interested in tracking that one down for you.

If you could get an actual dump of the crash that would obviously be better, but even a photo of the screen will do.

Hopefully, you'll find a way to recover your data...

Richman · Dec 25, 2013

cyberjock said:
What you need to get across is that I'm far less concerned about random cosmic radiation flipping a bit once and bad RAM leading to stuck bits and/or random bit flips due to the breakdown of insulation and/or conductors in your RAM.

I think the sentence structure is off in that sentence above.

Otherwise, Not detailed enough for me. But then I always have to know the exact data paths and whats being done at every millisecond. :D I guess I understood this point just as well from all the other threads it was mentioned in with the exception of a couple additional things pointed out here. Thanks for writing it.

no_connection · Dec 27, 2013

So. Why not implement measures to minimize damage if a non ECC memory decides to go very bad, or if ECC does not halt system?
It won't guarantee data integrity but I see no reason why it could not detect high rate of errors(or a few) from more then one disk and then stop doing things like scrub or write/serve data.
That way you would not corrupt the entire pool or do huge amount of damage. Just get a bit of corrupt data.

Bit flips and other uncommon data errors would still result in corruption on a smaller scale so ECC is still a requirement.
But a system should not try to damage your data when ECC memory is not used.

cyberjock · Dec 27, 2013

no_connection said:
So. Why not implement measures to minimize damage if a non ECC memory decides to go very bad, or if ECC does not halt system?
It won't guarantee data integrity but I see no reason why it could not detect high rate of errors(or a few) from more then one disk and then stop doing things like scrub or write/serve data.
That way you would not corrupt the entire pool or do huge amount of damage. Just get a bit of corrupt data.

Bit flips and other uncommon data errors would still result in corruption on a smaller scale so ECC is still a requirement.
But a system should not try to damage your data when ECC memory is not used.

There's a couple of things I want to remind you of:

There is no way to identify bad RAM except to do a RAM test. This is like asking an alzheimer's patient what they forgot. They can't tell you because they don't know! There is also no good way to identify or determine if a system does or doesn't have ECC RAM installed. It's handled in hardware and there is no "good" way to determine your RAM type.

fracai · Dec 27, 2013

I think what no_connection is suggesting doesn't require checking for ECC. FreeNAS would just watch for errors reported by zfs and stop all pool activity if some threshold is crossed while working on the pool. Basically, if some operation (including a scrub) hits 100 errors in some short time frame (consecutive read / write operations? time based would be dependant on drive / CPU speed) stop all pool IO and post an alert.

It seems like this would be useful for non-ECC users as well as those with ECC.

cyberjock · Dec 27, 2013

Well, he said "But a system should not try to damage your data when ECC memory is not used." which means you must be able to identify when ECC memory is not used.

The problem is that it literally takes 1 error to make a pool unmountable. Why you'd wait for 100 is beyond me as it is quite possibly too late to save your data at that point.

ZFS does have some built-in limits where if enough errors are racked up the disk is removed from the pool. The problem is there isn't a surefire way to prove what was corrupted in memory and what was corrupted everywhere else. To add to the mess, you really can't do a whole lot to a pool without it being mounted. But if its being unmounted on errors you're in chicken and the egg.

I just don't see how this would:

1. Be useful to save data.
2. Be programmatically feasible. We can't just stop at 100 errors as many people will never be able to complete a scrub.
3. Why we should even care that much about people that want to use non-ECC RAM. It's one of the design considerations for ZFS. Why someone would then start changing those considerations is beyond me. Especially for a project as small as FreeNAS. We don't have that kind of developer resources.

fracai · Dec 27, 2013

Capping at 100 (there's probably a better number statistically) is meant to catch the case where some operation starts triggering a bunch of errors and it's more likely that these errors are corrupting data instead of correcting it. You're right that ideally you'd stop after one error, but as far as I'm aware, there isn't a way to determine if the error was successfully corrected or if it was "corrected" with garbage. Stopping at one means you'd be more likely to be stopping even when ZFS legitimately corrects a block.

The errors could be caused because: a disk was damaged and needs to be replaced, a cosmic ray flipped a bit on disk, non-ECC RAM has a stuck bit, ECC RAM has multiple stuck bits. The basic idea is that it's an alert to the user. If it's due to disk failure, the user can replace the disk and ignore the error, the pool should be fine after a resilver. If it's bad RAM, the user can run a test on the RAM and replace it, and then determine if their pool is still useable (is there a way to list which blocks had "corrected" errors and associate those blocks to files or pool data?).

To address your issues:
1) it might keep a pool from being destroyed if all corrupted blocks (before the limit is triggered) are file blocks rather than pool data.
2) include a way to disable or raise the limit if you've just run a test and know the RAM is not the source of those errors.
3) this isn't just for non-ECC users. ECC just raises the number of errors in RAM that can be survived.

cyberjock · Dec 27, 2013

fracai said:
The errors could be caused because: a disk was damaged and needs to be replaced, a cosmic ray flipped a bit on disk, non-ECC RAM has a stuck bit, ECC RAM has multiple stuck bits. The basic idea is that it's an alert to the user. If it's due to disk failure, the user can replace the disk and ignore the error, the pool should be fine after a resilver. If it's bad RAM, the user can run a test on the RAM and replace it, and then determine if their pool is still useable (is there a way to list which blocks had "corrected" errors and associate those blocks to files or pool data?).

There is no way because ZFS considers any "correction" to not actually be writing new data. In fact, if you mount a zpool readonly ZFS will STILL correct any "errors" it finds. The error should be fixed because it is assume you aren't actually changing the data on the pool. When the situation has to be chosen between fix it now using parity or hope you can fix it in the future with parity, guess which choice is better? Parity shouldn't be changing and not fixing it later is a less intelligent decision.

fracai said:
To address your issues:
1) it might keep a pool from being destroyed if all corrupted blocks (before the limit is triggered) are file blocks rather than pool data.

I dont think that's correct. Look at how few people have had only file blocks corrupted and not metadata. 0% from what I've seen.

fracai said:
2) include a way to disable or raise the limit if you've just run a test and know the RAM is not the source of those errors.

See above where I said there's no way to know the source or setting a limit. It's not programmatically possible. ZFS works on the assumtion that there is no corruption it can't repair, or you lose the pool. There is no in between.

fracai said:
3) this isn't just for non-ECC users. ECC just raises the number of errors in RAM that can be survived.

I don't think that's accurate.

ECC decreases the likelihood of bit flip errors to something unlikely to occur in your lifetime.

ECC halts systems that have RAM errors that it cannot correct for. So it either provides that cushion or the system halts. There is no in between.

Richman · Dec 27, 2013

fracai said:
Capping at 100 (there's probably a better number statistically) is meant to catch the case where some operation starts triggering a bunch of errors and it's more likely that these errors are corrupting data instead of correcting it. You're right that ideally you'd stop after one error, but as far as I'm aware, there isn't a way to determine if the error was successfully corrected or if it was "corrected" with garbage. Stopping at one means you'd be more likely to be stopping even when ZFS legitimately corrects a block.

fracai said:
To address your issues:
1) it might keep a pool from being destroyed if all corrupted blocks (before the limit is triggered) are file blocks rather than pool data.

No guarantee. Which is probably why nobody would bother to write such code.

fracai said:
2) include a way to disable or raise the limit if you've just run a test and know the RAM is not the source of those errors.

Your talking about ZFS which is the files system level. While I am not some hard core coder or programmer and only really play one on TV, I am almost certain that writing code for the FS level is quite a bit different than at the kernal level or the program level. Would almost Probably have to be some code in an in-between layer between ZFS and FreeNAS I would think. But it would probably have to look for numerical patterns of errors and not just a certain number in a certain time frame to have any measure of effectiveness and the patterns would have to be complex and target patterns that don't normally happen. What patterns don't normally happen unless RAM is bad? There would have to be a lot of test case scenarios I would imagine and even then there are no guarantees. I think the answer is best described in cyber's response number 3 here:

cyberjock said:
3. Why we should even care that much about people that want to use non-ECC RAM. It's one of the design considerations for ZFS. Why someone would then start changing those considerations is beyond me. Especially for a project as small as FreeNAS. We don't have that kind of developer resources.

In other words, why would they spend time on it when the development was targeted toward enterprise equipment (the ones that use ECC) and not a consumer driven appliance. Why would they spread themselves thin. I am sure the market is there and money to be made just as Synology did with their line but maybe more developers is needed or maybe the dev management is not interested. I am sure it would be possible but then the mathematical patterns needed to do it effectively would have to be researched and tested and since I don't think errors happen every day, I can see it being a long drawn out project taking a very long time and need a lot of patience.
Hey fracia, have at it. If you manage to accomplish this I am sure all of those wanting to use Non-ECC will either kiss your boots or send you a tin of cookies.

From the additional things about ZFS I have learn lately, the key points are, and correct me if I am wrong:
1. It was designed to work with ECC
2. Without ECC you don't realize (if not most) maybe half of the benefits of ZFS.
3. If an pool corruption happens for any reason making it un-mountable there are NO recovery softwares to help you recover data that work with ZFS
4. Data corruption could happen easier, I think mainly memory errors, (and I am still a little foggy on this point as to how) or more likely to happen using ZFS without ECC
5. With No. 3 and 4 in mind, your better off going with some other FS, unless maybe you are planning on upgrading hardware to implement ZFS along with ECC in the very near future, like a couple months.

DISCUSSION:
Number 4 above brings to mind a foggy questions like:
1. If you are reading data from a FreenAS box that has ECC to a desktop or laptop or other device that does not use ECC, can not data corruption or flipped bits occur on these system and sent back wrong or damaged data to be written to the NAS? Your data is corrupted anyway and eventually you will have some corrupt data from some source, ECC or not. ZFS does not protect against this type of data corruption coming back from networked systems.

ARGUMENT
Stated here that it only costs about $250-300 to implement a server MB with ECC, but that may be a lot for some geeks.

Even though ECC make perfect sense
Are the ones so voiced at not using NON-ECC at any cost under ANY circumstances:
1. Just tired of fielding post from some that lost their pools. Maybe but possibly not that simple.
FIX: Don't answer those posts or just direct them to the HW req. wiki and/or sticky that list the pitfalls and problems in a concise manner, which I have not seen anywhere yet.
2. Are they more geared to enterprise where it is just imperative to use ECC for business proposes and use the 'your stupid if you ever use anything but ECC' as a blanket, catch-all statement, disclaimer and liability waiver?

Who knows for sure but it still makes the most sense. I did find this article at the bottom helpful where I snipped a clip

QUESTIONS:
1. are there any stats kept as to how many FreeNAS Mini's were sold and how many implemented ZFS with non-ECC and had problems?
2. Any stats at all on the percentage of problems in a time frame. ( I would be interested in reviewing this info)
I haven't researched what other appliances implement ZFS and how they mitigate or handle issues and pool problems from memory issues. Maybe I will shortly.

I think one of the MOST COMPELLING REASONS to use ECC for myself is that, even though I would keep critical data backed up and could rebuild a pool, at some point I may not want to spend time rebuilding a pool that I didn't have to and replace data that I didn't have backed up which would mean downloading from some source. Maybe nothing more than a pain in the rear, but one I really may not want to deal with or spend time on. My time is valuable. I think after reading what I have that I am more afraid of frozen bits than flipped-bits. Makes me not want to go without ECC for even a couple months.

RHETORICAL QUESTIONS that I may research again, unless someone knows answers:
1. How often do memory errors happen outside of memory going bad?
2. How often does memory go bad, or is this just memory that has become error prone creating error more frequent or has a frozen bit?
3. How often do frozen bits happen on modules?

From what I could uncover, mem errors happen on a heavily used system like a server maybe a couple times a month while on a lightly used system like a PC, once every 6-12 months. I am assuming from what I have read that in some large data centers, google has reported 1 error every hour per GB of memory deployed but I think this may be in part due to voltage fluctuations, increased EMF from surrounding equipment, and other drawbacks from large arrays namely heavy use. It is said that corrupt data is usually manifest in the message, 'file unreadable' or 'file missing'. I have seen that a few times in the last ten years. Maybe once every 6-12 months.

Now that I have stated that my time is valuable, I probably spent more money in terms of time researching the ins and outs of ECC vs, Non ECC that I could have paid for 16GB of ECC RAM for two FreeNAS boxes. I wish I could get paid for his or wish every geek that asks this question would do their own instead of getting us to do it.

Little tipdbit from another site I found helpfull:
https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/
Conclusion

ZFS was built from the ground up with parity, mirroring, checksums and other mechanisms to protect your data. If a checksum fails, ZFS can make attempts at loading good data based on redundancy in the pool, and fix the corrupted bit. But ZFS is assuming that a correct checksum means the bits were correct before the checksum was applied. This is where ECC RAM is so critical. ECC RAM can greatly reduce the risk that your bits are not correct before they get stored into the pool.
So, some lessons you should take away from this article:

ZFS checksums assume the data is correct from RAM.
Regular ZFS scrubs will greatly reduce the risk of corrupted bits, but can be your worst enemy with non-ECC RAM hardware failures.
Backups are only as good as the data they store. If the backup is corrupted, it's not a backup.
ZFS parity data, checksums, and physical data all need to match. When they don't, repairs start taking place. If it is corrupted out the gate, due to non-ECC RAM, why are you using ZFS again?

Last Edited 12:25 12/28/2013

cyberjock · Dec 27, 2013

Richman said:
In other words, why would they spend time on it when the development was targeted toward enterprise equipment (the ones that use ECC) and not a consumer driven appliance. Why would they spread themselves thin. I am sure the market is there and money to be made just as Synology did with their line but maybe more developers is needed or maybe the dev management is not interested. I am sure it would be possible but then the mathematical patterns needed to do it effectively would have to be researched and tested and since I don't think errors happen every day, I can see it being a long drawn out project taking a very long time and need a lot of patience.

You are missing out on some key facts. Before ZFS existed when Sun was looking at creating the "ultimate" file system(aka the future ZFS) they had a choice between modifying a current file system and volume manager design or creating a whole new one they made the decision to create a whole new one because it was simpler than trying to modify any existing one. Even ext has limits to how much stuff you can "bolt on" despite it having considerations for future expansion.

Sun had one thing in mind.. to make money providing this file system. For that reason, they were able to easily make assumptions like what kind of disk controller you'd be using, what kind of RAM you'd be using, what kind of processing power you have, etc. They did not give a crap about anyone wanting to build a home server. You were not their target market... at all. They wanted big businesses to drop 5+ figure dollar amounts for contracts for server with Sun. PERIOD. Anyone with any delusions that ZFS is "designed for home servers" is delusional. PERIOD. Don't like it, too bad. Not everyone is trying to sell their product to the whole world all of the time.

Contrary to popular belief and what people might think, I think people building small servers with a single disk or 2 and non-redundant pools shouldn't even be using ZFS. Especially considering how many people are clueless with how to properly admin FreeNAS and ZFS. People with no FreeBSD experience are literally taking bigger risks that sticking with their tried-and-true OS of choice.

Richman said:
From the additional things about ZFS I have learn lately, the key points are, and correct me if I am wrong:
5. With No. 3 and 4 in mind, your better off going with some other FS, unless maybe you are planning on upgrading hardware to implement ZFS along with ECC in the very near future, like a couple months.

That whole "very near future" is a bit naive in my opinion. You have to keep in mind any corruption written to the pool is forever. It's already blown by using non-ECC RAM if bit-flips end up in the file system. And lets face it, its just stupid to use non-ECC for "a while" with the expectation you'll upgrade later. Just what you need is a ticking time bomb of corruption somewhere that bites you months or years later without warning. My argument is basically "do it right or not at all".

Richman said:
DISCUSSION:
Number 4 above brings to mind a foggy questions like:
1. Why are we so enthusiastic about discouraging Non-ECC use if we do popper back-ups since, if you are reading data from a FreenAS box that has ECC to a desktop or laptop or other device that does not use ECC, can not data corruption or flipped bits occur on these system and sent back wrong or damaged data to be written to the NAS? Your data is corrupted anyway and eventually you will have some corrupt data from some source, ECC or not. ZFS does not protect against this type of data corruption coming back from networked systems.

Sure, the files can be corrupted. But the zpool metadata can never be corrupted, which is what really matters. Just check out the thread of the guy copying pictures from his SD card to his test zpool on a FreeNAS machine he had just built. They were randomly corrupting and he couldn't figure out what was wrong. Turns out his desktop had bad RAM and was corrupting the jpegs in-memory before they were sent over CIFS.

Richman said:
I still think it is possible that the ones so voiced at not using NON-ECC at any cost under ANY circumstances are:
1. Maybe just tired of fielding post from some that lost their pools.
FIX: Don't answer those posts or just direct them to the HW req. wiki and/or sticky that list the pitfalls and problems in a concise manner, which I have not seen anywhere yet.
2. They are more geared to enterprise where it is just imperative to use ECC for business proposes and use the 'your stupid if you ever use anything but ECC' as a blanket, catch-all statement, disclaimer and liability waiver.

There's a certain amount of responsibility expected when you are a moderator. You should be taking care not to point people in the wrong direction. As such, I created THIS thread as words of warning.

Richman said:
I haven't researched what other appliances implement ZFS and how they mitigate or handle issues and pool problems from memory issues. Maybe I will shortly.

Most do ECC RAM. A few don't, and they've been flogged appropriately on Amazon and such. Some companies are more interested in making money than protecting your data. And it's your job to make that informed distinction and decision. If you can't make that informed distinction and decision when necessary maybe IT isn't for you.

Richman said:
I think one of the compelling reasons to use ECC for myself is that, even though I would keep critical data backed up and could rebuild a pool, at some point I may not want to spend time rebuilding a pool that I didn't have to and replace data that I didn't have backed up which would mean downloading from some source. Maybe nothing more than a pain in the rear, but one I really may not want to deal with or spend time on. My time is valuable.

And here's where I know for 100% certainty you did NOT read this thread before posting. If you did you'd know that backups won't save you from RAM corruption. /smh in serious disgust.

Richman said:
I am under the understanding that a memory error in a normal desktop (Non-ECC) would most definitely be preceded by the system wanting to do a scan-disk or check-disk routine.

ZFS is one of the only file systems able to check for corruption properly.

Richman · Dec 28, 2013

cyberjock said:
You are missing out on some key facts. Before ZFS existed when Sun was looking at creating the "ultimate" file system(aka the future ZFS) they had a choice between modifying a current file system and volume manager design or creating a whole new one they made the decision to create a whole new one because it was simpler than trying to modify any existing one. Even ext has limits to how much stuff you can "bolt on" despite it having considerations for future expansion.

Sun had one thing in mind.. to make money providing this file system. For that reason, they were able to easily make assumptions like what kind of disk controller you'd be using, what kind of RAM you'd be using, what kind of processing power you have, etc. They did not give a crap about anyone wanting to build a home server. You were not their target market... at all. They wanted big businesses to drop 5+ figure dollar amounts for contracts for server with Sun. PERIOD. Anyone with any delusions that ZFS is "designed for home servers" is delusional. PERIOD. Don't like it, too bad. Not everyone is trying to sell their product to the whole world all of the time.

Contrary to popular belief and what people might think, I think people building small servers with a single disk or 2 and non-redundant pools shouldn't even be using ZFS. Especially considering how many people are clueless with how to properly admin FreeNAS and ZFS. People with no FreeBSD experience are literally taking bigger risks that sticking with their tried-and-true OS of choice.

I think I didn't miss as much of that as you thought I did. Not sure what this matters now since iXsystems nor FreeNAS benefits from Sun or Oracle code now or any development code they implemented since they made it closed source. Wasn't the last open version moved to OpenZFS organization? So they can develop it as they see fit not according to some Sun Corp's original masterminded plan.

cyberjock said:
Yeah, except that whole "very near future" is naive. You have to keep in mind any corruption written to the pool is forever. It's already blown by using non-ECC RAM if bit-flips end up in the file system. And lets face it, its just stupid to use non-ECC for "a while" with the expectation you'll upgrade later. Just what you need is a ticking time bomb of corruption somewhere that bites you months or years later without warning. We keep saying "do it right or not at all". Guess what? You should "do it right or not at all". There is no 1/2 a**ing it. You do it right or you take the risk. Don't like it, too bad. Life sucks. Deal with it. I really am sick and tired of hearing people complain because someone engineered something that has requirements that they don't like. Guess what? Sun didn't give a crap about you. They cared about making money, and that meant NOT from you!

Which is why it scares me to even think the thought. Anyone ever tell you that you type like an angry old man?

cyberjock said:
Sure, the files can be corrupted. But the zpool metadata can never be corrupted, which is what really matters. Just check out the thread of the guy copying pictures from his SD card to his test zpool on a FreeNAS machine he had just built. They were randomly corrupting and he couldn't figure out what was wrong. Turns out his desktop had bad RAM and was corrupting the jpegs in-memory before they were sent over CIFS.

But if it was a copy then he still had the original pictures uncorrupted so he didn't loose anything in that case.
I don't see how the zPool Meta data is all knowing when you get a file from the pool and change the file on a machine somewhere and then save it back to the pool. If you have a bit bad on a memory mod on your PC or laptop when you save it back, it is corrupted period. Its like saving an original file. The zPool Meta data doesn't know what that file was supposed to be that came from your PC or laptop.

cyberjock said:
You and I are gonna have a problem here:

Not if you see where I made edits just before you posted.

cyberjock said:
For starters, there's a certain amount of responsibility expected when you have to be a moderator. You should be taking care not to point people in the wrong direction. As such, I created THIS thread as words of warning.

As I know you did ......

cyberjock said:
For second, you said "just direct them to the HW req. wiki and/or sticky that list the pitfalls and problems in a concise manner, which I have not seen anywhere yet." WTF thread do you think you are posting to? Did you even read this thread before posting here? I recommend you go back and read this thread before you and I have a serious problem. I spent hours writing up the post that started this thread. And I did it as a volunteer. What you just said could be amounted to a "f*ck you" directed at the OP... me.

I reworded it since I meant to write it as a means to analyze the hypothesis or theory for anyone still thinking that it is still bull to use non-ECC or that they think its a conspiracy. I realized after reading more information that it was almost irrelevant what others may still be thinking and at most worded it the wrong way so as to give the wrong impression. You obviously read it the wrong way and made the wrong impression from it. It wasn't meant as an f.u to anyone and one thing I am not trying to be is a PRick, a stick in the mud, pain in the *ss. I think I tried to get that across in other posts. Just trying to help and make the forums informative for others with easy to find info.

cyberjock said:
As for number one, I doubt anyone did. And FYI... iXsystems sold those with non-ECC because there was no alternative. PERIOD. There was no board at any cost that used ECC RAM. Guess what? Now that they do exist iXsystems is testing several right now as we speak!

And we all expressed interest in that subject and are on pins and needles.

cyberjock said:
Most do ECC RAM. A few don't, and they've been flogged appropriately on Amazon and such. Some companies are more interested in making money than protecting your data. And it's your job to make that informed distinction and decision. If you can't make that informed distinction and decision when necessary maybe IT isn't for you.

Even though I don't like to be a assumptive person, I assumed as much but wanted to make sure. I guess I know more than I realize.

cyberjock said:
And here's where I know for 100% certainty you did NOT read this thread before posting. If you did you'd know that backups won't save you from RAM corruption. /smh in serious disgust.

HERE is where you would be 100% WRONG in assuming with any certainty that I didn't read anything. I read all 5 pages and all of the stickies. I am actually not very lazy when it come to reading info.or forum threads. I was talking about original files as backups that a lot of people would have before copying them to the pool. There is no way for the original file to get corrupted if they are not touched by the NAS box. I guess I need to clarify myself when thinking with my fingers on the keyboard.

cyberjock said:
Why would you even think this? You do realize that ZFS is the only file system capable of detecting corruption, right?

Right, and I didn't think that. I failed to explain I was talking about error that corrupt the FS. CHkdsk does do this. Then I realized that it was probably a small number than ones that corrupt data and was mute and removed it.

cyberjock said:
I'm shocked that you'd even think another OS would

You must get shocked often as I was not talking about any specific error correction on data corruption. Was only talking about file system errors and other OS's do do this.

cyberjock said:
You realize EOS memory was a 1990s technology. It was for SIMMs and wasn't appropriate for DIMMs, hence the technology died 15 years ago. Some of those technological breakthroughs were necessary for ECC RAM to exist in the forum it does today.

I don't care about this anymore since I removed it before hand. Why are yo so quick at jumping all over a post? I don't think I had that up barely 5 minutes before I removed this one point or question ands you jumped all over it.

cyberjock said:
Yeah, totally wrong on those numbers. But... nobody has good numbers they've been able to provide. We've got Intel/Samsung/Corsair propaganda supporting error rates in excess of 5000 per 24 hours powered on, and others claiming just a few per power-on year. So your guess is as good as mine. If you had read this thread I'm pretty sure one of the forum admins has commented that they are disappointed in how little solid real-world testing has been performed to determine error rates and it's mostly left to companies with a vested interest in overestimating error rates for profits(which goes back to that stuff I said above about informed distinction and decision.

Yep, that is correct and exactly what I found from all the tons of reading I did. I am glad you took the time to type it so I didn't have to. It depends on sooooo many factors which is one reason there is no good data that is reliable. :D:p

cyberjock said:
Now go back and read this thread before I have to go kill a baby seal over your comments that make it obvious you are talking about what you have not read.

I am envisioning huge piles of dead baby seals all around your house with you standing over them with a wooden Flintstone's style club, a corn cob pipe and straw hat.

Richman · Dec 29, 2013

cyberjock said:
ECC doesn't work anything like whatever you are trying to explain. There is no "ECC module" at all. In fact, I have no idea where you get the idea that there is an "ECC module" on memory sticks at all.

I understood where jyavenard was coming from and what he meant. He was refering to 'Module' as synonimous with 'Stick. 'Stick' of RAM. Module of RAM. Ram is generally referred to as a 'Modules. ie: 'I installed two additional RAM modules in my system today'. He didn't mean there was some ECC mudule chip on the MODULE or STICK. Geeeez man, its pretty simple really to understand where someone is coming from but some have a hard time not with understanding a few but with understanding most.

cyberjock · Dec 29, 2013

Yeah, except nobody I've ever talked to, and none of the admins here have ever heard of someone say "I installed 2 RAM modules in the server". It doesn't matter, that whole misunderstanding is water under the bridge.

Richman · Dec 29, 2013

cyberjock said:
Yeah, except nobody I've ever talked to, and none of the admins here have ever heard of someone say "I installed 2 RAM modules in the server". It doesn't matter, that whole misunderstanding is water under the bridge.

Your right, it should be 'water under the bridge'. Why not delete that posts as I was thinking it doesn't lend itself to help the thread any. Not sure why I posted it except that I may have been in a weird mood. But FYI to anyone, the PCB that the memory is attached to are called RAM modules. They are not technically referred to as 'Sticks' or anything else. I personally have seen in many forums , them being referred to as 'Modules'.

EDIT
Just as jyavenard pointed out that the last M in DIMM means ....... Dual In-Line Memory MODULE. I think i have heard them referred to in forums as RAM Module or secondly DIMM a lot more than stick or anything else.

HTTP://lmgtfy.com/?q=DIMM

cyberjock · Dec 29, 2013

I've thought about it, just never got around to it. There's been talk of just locking all of the stickies because we're not really trying to have discussions about stickies. Usually it's pretty sound advice and we'll proven in the forums if you want to do some searching.

fracai · Dec 29, 2013

cyberjock said:
ECC decreases the likelihood of bit flip errors to something unlikely to occur in your lifetime.
ECC halts systems that have RAM errors that it cannot correct for. So it either provides that cushion or the system halts. There is no in between.

This basically comes down to a misunderstanding on my part regarding ECC. When I read "can correct single bit errors" and "detect, but not correct two-bit errors", I internally understood it as "cannot detect if more than two bits are in error". So, yeah, the likelihood of even two errors is really low and the odds of getting enough errors, in the right place to lead to a checksum that looks good, but isn't, probably exceeds the age of the universe.

I stand corrected.

DrKK · Dec 29, 2013

I'm going to call them modules from now on. Just as a matter of principle.

Richman · Dec 29, 2013

Technoid said:
If you are converting old hardware to a franken-nas and are coming from "a bunch of disks on a windows xp computer" ™, your data is still safer on a Freenas running ZFS in raidz on the franken nas than it was on the windows pc even without ECC...

Is it really? I was thinking it was the other way around, that data could get corrupted easier or faster on a system with ZFS and non-ECC ram than any other FS on any other OS. Anyone want to chime in their ideas or thoughts?

As a note, I have about 10 machines here at my house, none of them servers. I am going to boot them all up and do a memtest. Currious how many have bad RAM.
I wish that there was a way to know how many FreeNAS users build with commodity desktop equipment and had issues or didn't have issue and how long they have been running. Not that I am planning on trying non -ECC or trying to prove an argument for it. I am just really curious of the numbers.

Technoid said:
And really We shouldn't have to have this discussion since ECC should be the norm. (Funny enough almost all AMD cpu's support it and has done it since Athlon 64's, but try and find a motherboard that supports it....

Are AMD MB that support it really that rare? There are not many AMD server board that support it? I'm going to check and edit when I find out. Oh, your probably talking about there being NO desktop boards that support it? That makes sense as there are not many Intel desktop boards that support ECC either or that would solve everyone's pocketbook woes and would make this a non discussion.

panz said:
But after reading the _very good_ post of CJ I'm SCARED!

CJ scares all of us. He is a very scary guy(guarddog) that lives in a very scary doghouse, probably in a very scary end of town. I'll put that to music later.

cyberjock · Dec 30, 2013

One test I'm thinking about running...

The very first FreeNAS box I built had non-ECC RAM. It had some weird behavior. Every scrub it would find a handful of checksum errors on every disk. Mind you, ever disk passed every test I ever threw at them. But every scrub, every disk(all were 1.5TB disks I believe) would have 5-30 CHKSUM errors if you did a zpool status. Usually 200-500Kb were "repaired" and that was that. Nobody really understands what was going on or why it does it.

Today I noticed something. A friend is using ZFS on Linux with a non-ECC system(don't ask...) and I noticed when I do a zpool status on his Linux box with ZFS you get a "scrub repaired 432k". I save his zpool status every night for historical purposes. Well, I went back and looked and every scrub he's run since his desktop was built 8 months ago has 150-500kb that were "repaired". I'm wondering if there is a link between these "repairs" and non-ECC. Could this be a way to definitively prove that non-ECC is more destructive than we think?

So here's my plan. I have a system, my old FreeNAS box. It has a Westmere socket 1366 Xeon and can use ECC or non-ECC RAM. I'm thinking I might pull out a few spare working drives and make a pool, fill that sucker up with fake data and do some tests to see what is going on. Since all of the hardware will stay the same except the RAM it might be worthwhile to see what the result is. I'm fortunate to have this board and CPU as I can and have used both ECC and non-ECC RAM and I have confirmed that the ECC does work when I use ECC RAM. Hmm.... how devoted am I to this?

Important Announcement for the TrueNAS Community.

ECC vs non-ECC RAM and ZFS

Explorer

Patron

Patron

Patron

Inactive Account

Guru

Inactive Account

Guru

Inactive Account

Patron

Inactive Account

Patron

Patron

Inactive Account

Patron

Inactive Account

Guru

FreeNAS Generalissimo

Patron

Inactive Account

Similar threads