BUILD 300% More worth it?

Nomad · Oct 16, 2013

Hey everyone,

I got a buyer for my "Production box"(Sig) already as I had a friend that became very interested in my new addiction (FreeNAS) and offered me a price I couldn't turn down.

Can someone please tell me WHY IN THE WORLD it's worth it to upgrade from SUB 200 to 600+ Just for ECC if we are all backing up outside the NAS Anyways?

All of this will be based on 4x4TB RaidZ2
All of these systems will push over 1Gbit LAN, I am only storing movies that I can re-rip if I lose everything, painful, but can be done.

Build 1: ~$195
AMD E-350
Asus E45m1 - Included
16 GM Ram Non ECC - 100

Build 2: ~$240
Intel G3220 - 60
ASRock - 80
16 GB Ram Non ECC - 100

Build 3: ~$600
Intel E3 1230V3 - 260
SuperMicro X10SLQ-O - 165
16 GB Ram ECC - 200

I'm not the best at building a cheap ECC system, but if someone could help me out that might help things along. Looks like the G3220 supports ECC so If I can just find the ram cheaper and the board might be golden.

Nomad · Oct 16, 2013

After some more digging I've come up with a Hybrid build.. but still can't find ECC ram for anything less than close to twice the price of Non-ECC.

Non ECC- $59 ($20 Off Promo) - http://www.newegg.com/Product/Product.aspx?Item=N82E16820148540

ECC - $96 http://www.newegg.com/Product/Product.aspx?Item=N82E16820139978

Build ECC Best Prices: $425 till double the E-350 :(
Intel G3220 - 60
SuperMicro X10SLQ-O - 165
16 GB Ram ECC - 200

cyberjock · Oct 17, 2013

Why not take a look at the convenient stickies that explain all of this for you....http://forums.freenas.org/threads/ecc-vs-non-ecc-ram-and-zfs.15449/

All you need to do is consider the consequences of waking up one day to a trashed pool. If reripping your DVDs would take you 5 hours of work and you have no other data worth anything on the pool, then I wouldn't go with ECC. But if you are like me and have well over 600 DVDs in multiple tubs in the basement all ripped and I think about either spending 100s/1000s of hours reripping and reencoding them all or spending $300 for ECC RAM, guess which I'll go for.

As the sticky explains, it's your win or loss. There's tradeoffs in life, and you get to make one now that can change your life forever if you lose your wedding pictures and pictures of your child. Or you might lose a handful of DVD rips that you can replace in a few hours.

Choose wisely!

Nomad · Oct 17, 2013

I read the sticky and about how the bits get flipped back and forth to trash the pool, but I do believe also that you have said that a single ZFS is not a backup, and as such those 600+ DVD's should be backed up elsewhere so shouldn't be thousands of hours to replace :p

Since the G3220 Supports ECC I will most-likely build an ECC system once the price comes down for ram. On a small scale it doesn't seem like much of an issue, but thinking forward and building a 32GB system it gets pretty crazy.

Redundant Systems. Total $400 vs ECC Single System $625
E350+Non ECC vs G3220+SM+ECC Ram
Add in Drives and it changes depending on the size of your pool.
2x3TB Mirror Should cover about 90% of home builds ~ 100 on Green's 130 on Reds

400+400=$800 for Fully Redundant Systems
625+200=$825 for Single ECC System.

So really which is the better choice considering having 2 of these units does count as a backup.

Do keep in mind I value your input, I just like to think outside the cookie cutter box. I'm really all about trying to push the most out of the smallest, to me anything else is wasted spending.

Between my 5 year old MSI GT735 used from Main Gaming and now HTPC
DLP 70" TV that I've repaired three times and bought used for $500 including stand (Repair costs $300 over 5 years)

cyberjock · Oct 17, 2013

If you read the whole sticky and you do zpool replication or use rsync, you can expect your backup pool to be trashed if you aren't using ECC either. ;)

This is why the whole ECC vs non-ECC RAM can get very dangerous. Literally everyone that has had non-ECC RAM fail has killed their pool. And if they had a backup pool(even if it was on a different system), it was trashed as well. It's a one way ticket to hell. The two reasons why I am a hardcore advocate of ECC RAM:

1. That people are assuming that if they have a backup system and a backup pool then the think everything is all kosher. The reality is that they can kiss their backups goodbye as well.
2. You get no warning message so you can check your backups and make sure that those very very important childhood pictures, wedding pictures, etc are backed up. At least with any zpool with redundancy, if you are managing it properly, you will have some warning that things aren't ideal. You can take some action to backup the most important data on your pool. With non-ECC, you find out when your pool is f&^(*d. I don't know about you, but I want warning. I want a chance to fix it before its too late. For bad RAM, its already too late once you've found it. And 100% of forum users that have had non-ECC RAM fail on them are examples of this.

So thanks to trashed backups and no warning that anything is wrong until its too late, does it sound like you can do anything to protect your data if you have non-ECC RAM and a stick goes bad? You've utterly set yourself up for failure. If you know of a way that nobody else has thought up to save your data that doesn't involve doing online backups in any way/shape/form and will warn you when non-ECC RAM fails I am 100% all ears to hearing how you know how to mitigate this disaster!

Nomad · Oct 17, 2013

Well, personally I don't think anyone should be using freenas for wedding photos. There are countless online cloud solutions that will back that data up much more reliably than what can be done at home.

I do have a questions about the rync just so I understand...

Most of us are not making changes to these photos, video's, etc and this is where my understanding of your sticky breaks down.

Are you saying that if I just open a photo/video and I'm not making any changes to it, just streaming or viewing something in FreeNAS will write the data it loaded into cache back to the harddrive?

cyberjock · Oct 17, 2013

Assume the following:
1. You are rsyncing from pool 1 to pool 2 and each pool is a separate system with non-ECC on both ends.
2. Your primary FreeNAS system has non-ECC RAM and has failed. Backup system has non-ECC RAM(or even has ECC RAM.. it doesn't matter), but it does not have any errors.
3. For simplicity we'll assume that you were on vacation for the last 3 days, so your daily rsync should find no data has changed since you were in... Hawaii!

Rsync will do two things when you initiate the transfer. Each server will go through all of its files looking for files that either don't exist on the destination, don't exist on the source, or have changed because of checksums. Each machine will be responsible for its own checksum calculations.

For checksums, each system will calculate its own checksums. As it reads and checksums every file on pool 1 the files will appear to have changed because its getting trash data. YThat change will be because the data is being read into RAM, trashed, then checksummed by rsync. When pool 2 does its checksum all of the files will provide the proper checksum as they are being read correctly. So the files will "appear" to be changed between your primary and backup server. So that means that the primary server needs to send the "new updated" file to the backup server since the backup server's checksum doesn't match rsync. But what's really going on if you are paying attention is that rsync is currently trashing all of the files on your backup server with garbage from the primary server.

So a few days(or maybe weeks go by) and you are totally unaware of the silent killer trashing your backups.

One day you go to open up a picture and its trashed. You wonder why. So you close it and open it. It's still trashed. Then you start checking things out in your pool. You do a zpool status and see some chksum errors, but nothing really really ugly. You decide that this isn't a big deal as you'll just grab the backup. Oh my.. the backup is trashed too. So now you come to the forum and start asking questions.

"Why is ZFS failing me!? It's supposed to be reliable but my pictures won't open? What did I do wrong?" One of the forum users asks you to post some command outputs. Nothing looks to terribly wrong. Hard disks are in good health, nothing seems wrong. But then the jerk shows up(We'll call him..... cyberjock). He says "hey, try a RAM test op!" You download the ISO from memtestx86.org and boot it up. You immediately see RAM errors galore.

So back to the forums you go. You post back that your server failed the RAM test. You want to know what to do next. What do we tell you? First, we tell you "that sucks", then we tell you that you are another sucker and another statistic for the forum. Then you are left to figure out what you are going to do. :/

I know this story all too well because virtually every user that has lost their data to bad RAM has the exact same story. They thought they had done EVERYTHING right. They had nightly backups automated with rsync(or zfs replication) and they were sure that their data was 100% protected when they went to bed every night. They had regular ZFS scrubs, they had regular hard drive testing and monitoring. They even were so awesome to have the server send a text message if a hard drives started to misbehave. The only thing they didn't realize was that non-ECC RAM would cause all of the money they've spent to instantly be in vain. And since the way you identify bad RAM is called "ECC RAM" and you don't have that you are left in the cold to destroy your pools and lose everything you thought you had generously protected with a RAIDZ3 pool and a backup system.

Pretty shitty story isn't it? And there's a long list of people that have stories that mimic this one.

Nomad · Oct 17, 2013

Let me narrow that down for you just a bit more.... how is the data getting trashed? that is where I'm lost...

Are you saying the the way ZFS is unlike ANY other file-system and that when you read data it loads it into ram and then for some unknown reason WRITES that data back to the hard-drive, even tho no changes have been made to the file? (I understand saving data to the Raid will be bad if the ram is bad, but I'm losing how opening a file that was stored months/years ago will suddenly get trashed)

cyberjock · Oct 17, 2013

Well, it will write it back to the primary pool since ZFS has decided that the data is corrupt on the disk(when in reality the data was corrupt in RAM), which I didn't discuss because I was trying to show that the backups will be useless.

I don't understand you though. Rsync works by checksums, and those checksums will not match. date/time of files are NOT a reliable method to determine if files have changed. Many programs will make changes on purpose without update the date/time stamp.

So the files are compared between the 2 systems, file by file, via checksum. As soon as the checksum doesn't match or a file is missing, then an update is made to the backup pool appropriately. In our case, since the pool is being checksummed from the first file to the last file regardless of the date/time the potential exists to trash all of your backups in 1 single rsync. Some people have had single bits that were stuck, so the corruption is very slow and takes weeks to damage all of your pool, but that makes it worse. RAM that goes horribly horribly wrong will often crash the primary system. It just depends on how lucky you are. Since 90% or more of your RAM is used for ZFS, you only have a small chance of corrupting the kernel which may result in a panic.

Nomad · Oct 17, 2013

Never mind Rsync for the moment, I'm trying to find the point of corruption.

Lets say I have 6TB of data stored... I only read this data never make changes to it... watch movies... slide shows... ETC. One Day Ram goes bad... everything I save PAST the ram going back I get is corrupt, but how does the 6TB of data prior to that get corrupt while reading and not writing to the drive? If I'm understanding you correctly

I open my resume in Windows on client.
Sends request to FreeNAS and FreeNAS Loads that file into RAM.
I print my Resume... and Close Office on the Windows Client (NO CHANGES MADE)
FreeNAS for some crazy reason WRITES that file from RAM to Hard-Drive.

Next Week I open my resume and it's corrupt... that would not happen on any other system. I find it hard to believe that FreeNAS works like this.

Nomad · Oct 17, 2013

I mean... even ixSystems AIO doesn't use ECC Ram.

Both the FreeNAS Mini&Mini Plus use a Intel Proc which doesn't support it.
Intel Core i5-3470T: http://ark.intel.com/products/65703/

cyberjock · Oct 17, 2013

Nomad said:
Never mind Rsync for the moment, I'm trying to find the point of corruption.

Lets say I have 6TB of data stored... I only read this data never make changes to it... watch movies... slide shows... ETC. One Day Ram goes bad... everything I save PAST the ram going back I get is corrupt, but how does the 6TB of data prior to that get corrupt while reading and not writing to the drive? If I'm understanding you correctly

I open my resume in Windows on client.
Sends request to FreeNAS and FreeNAS Loads that file into RAM.
I print my Resume... and Close Office on the Windows Client (NO CHANGES MADE)
FreeNAS for some crazy reason WRITES that file from RAM to Hard-Drive.

Next Week I open my resume and it's corrupt... that would not happen on any other system. I find it hard to believe that FreeNAS works like this.

FreeNAS doesn't work like that, ZFS works like that.

It'll be like this:
I open my resume in Windows on client.
Sends request to FreeNAS and FreeNAS Loads that file into RAM.
ZFS will realize that the data is corrupt. It will assume this is a hard drive/data storage issue and NOT a RAM issue.
It will "fix" the bad data in RAM. But remember, ZFS can't fix stuff that is corrupted in RAM.
ZFS will then write the "fixed" data back to your pool. But remember, its not really fixed.
I print my Resume... and Close Office on the Windows Client (NO CHANGES MADE).

ZFS is constantly trying to keep everything working at 100%. So if it finds corruption it attempt to fix it. This is not any different than what your hardware RAID does. It will simply rewrite that "stripe" of data when your parity data is needed. If ZFS didn't constantly correct mistakes it found, the first time you lost a disk with a RAIDZ1 you'd be in serious trouble because first it would find an error, then you'd have no parity! Big fail!

All scrubs do is traverse through your entire pool looking for mismatches between parity and data and correct them. ZFS does this automatically as it normally works, but since I'm sure you don't plan to open every single byte of every file on your server regularly, you do scrubs. Or at least you should be. ZFS scrubs as it goes.

What is scary is that your standard hardware RAID does NOT scrub as it goes. So any silent corruption is just that... silent. You have no clue your data is being read wrong, and the controller does NOT use parity data unless your hard drive actually reports a read error.

Nomad said:
I mean... even ixSystems AIO doesn't use ECC Ram.

Both the FreeNAS Mini&Mini Plus use a Intel Proc which doesn't support it.
Intel Core i5-3470T: http://ark.intel.com/products/65703/

Yes, and the answer I can give you on that is that it is your responsibility to ascertain the appropriate hardware for your situation.

74m · Oct 17, 2013

Nomad, you mentioned the SuperMicro X10SLQ-O. I couldnt find this board on the supermicro homepage, but to X10SLQ and i didnt see any ECC support for this!?
Maybe the other X10 Models with C22x chipset will work?
But I dont get, why supermicro writes "Intel Core i3/5/7 non-ECC-udimms, if you hover above the Link...?!

Nomad · Oct 17, 2013

Good Call... Froogle Dropped the ball on that one and I didn't both to double check.

I don't think I'll be going ecc after reading a few things about how ZFS Checksums work.

Nomad · Oct 17, 2013

cyberjock said:
FreeNAS doesn't work like that, ZFS works like that.

It'll be like this:
I open my resume in Windows on client.
Sends request to FreeNAS and FreeNAS Loads that file into RAM.
ZFS will realize that the data is corrupt. It will assume this is a hard drive/data storage issue and NOT a RAM issue.
It will "fix" the bad data in RAM. But remember, ZFS can't fix stuff that is corrupted in RAM.
ZFS will then write the "fixed" data back to your pool. But remember, its not really fixed.
I print my Resume... and Close Office on the Windows Client (NO CHANGES MADE).

This makes absolutely no since to me... and

cyberjock said:
This is not any different than what your hardware RAID does.

This is very different than any hardware RAID I've ever used. It doesn't write data back to the drive unless something has changed.

It's also strange that I can't find this ECC thing anywhere...
http://en.wikipedia.org/wiki/ZFS#Data_integrity
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

And to wrap it up before I goto work...
http://research.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf

74m · Oct 17, 2013

This is the ECC thing: http://en.wikipedia.org/wiki/ECC_memory
And afaik this is necessary if scrub should protect and not damage your files.

But will i get a response from my system, if ECC was active? And how can i be sure, that ECC is working? Memtest, or this?

cyberjock · Oct 17, 2013

Nomad said:
This makes absolutely no since to me... and

Not sure why. Its actually pretty logical.

Say you have a 4 disk RAIDZ1. Each "Stripe" has 3 disks of data and 1 disk of parity. Imagine if you have 100% reliable RAM. If ZFS already knows that one of those four parts are bad and knows which part is bad, why would you NOT want to fix it? It know something is wrong. If it willingly chose not to fix it, then you really have no redundancy for that stripe since you are already missing 1 of the 4 pieces. So if you were to lose a disk at that moment you'd be in trouble as you'd start resilvering and when you got to that stripe... oops. It has no protection for missing 2 of 4 pieces. Instant resilvering failure. So either you acknowledge the error and fix it while you have the ability to, or you ignore it and potentially have problems with the pool later. Unfortunately, where the "ZFS knows" is that its assuming its not bad in RAM and that something else in the storage path such as cabling, hard disk platters.

It's no different than calling up the bank the instant you realize there's a bank error. You don't want to wait until later when checks start bouncing because of an unauthorized withdrawal. You want it taken care of right now. You know something is wrong, why wouldn't you fix it? ZFS just tries to proactively fix any errors as it goes. It makes perfect sense to me.

The only place that things go horribly wrong is if you aren't using ECC RAM. Then instead of it being a bad disk its bad RAM locations. Then ZFS does the wrong thing. And since ZFS was supposed to be the most reliable file system ever built they made a choice. The choice was to assume the RAM is good and trust that the disk is bad, or vice versa. Guess which one happens far far far more frequently. Even more so if you choose to use ECC RAM. So they made the conscious choice to assume the RAM is good and that the disk is bad(or at least something in the data path from the disk is causing corruption). If the corruption was a fluke from cabling or something, no harm done since it rewrote exactly what was already on the platters. But if the disk was bad then the error should be corrected. The better way to verify that is to do regular scrubs, which is already recommended.

Nomad said:
This is very different than any hardware RAID I've ever used. It doesn't write data back to the drive unless something has changed.

Oh, I promise you, if a disk has a read error the array WILL write that stripe again. It's not that anything changed, its that you had a read error so it attempted to fix it by writing to it again so the hard drive will remap the bad sector to a spare sector, then write your new data that was recalculated from parity. Now you have full redundancy. If it didn't do that, it would be just like my example above. You've technically lost some amount of redundancy for that stripe of data. So again, why would you willingly NOT fix the error when you can fix it right now and instead wait until later when it might cause data loss for your array. I know what I'd do if I wanted to sell my RAID controllers to you and I didn't want customers complaining about corrupted data and failed array rebuilds from using my controllers.

I'm not sure how you could prove the reads and writes on hardware RAIDs without causing hardware read errors. But with ZFS is easy. zpool iostat 1 and/or gstat will show you second for second what the reads and writes are. If you were to write some trash data to one disk, as you read your data you'll see it rebuild on the disk you trashed. It would require some work as you don't want to trash the partition table or the zfs identifiers. But its totally doable. I had a bad disk, so I got to see this first hand with my blazingly fast 5MB/sec 6 disk RAIDZ2 pool.

Nomad said:
It's also strange that I can't find this ECC thing anywhere...
http://en.wikipedia.org/wiki/ZFS#Data_integrity
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

And to wrap it up before I goto work...
http://research.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf

1. I can't believe you'd use Wikipedia as a reliable source for data. /hangsheadinshame
2. Another wiki.. really? Do you not trust technical documentation from Oracle and/or Sun? Not to mention that, if I'm not mistaken, Solaris was for specific hardware only sold by Sun. And all of it had ECC RAM. Open solaris was the general availability software, and that's not what that wiki is for.
3. That presentation has been linked to many times here. Nothing surprising about it and its all been discussed to death in other threads. One thing someone commented on is that the presentation you linked is to Sun's ZFS implementation and not FreeBSD's. As such, results may differ. There are some things that internally work a little differently in RAM on FreeBSD than with Sun's ZFS implementation, but will still provide the same on-disk final product. Don't ask me what those are as I never used Sun products and documentation from Sun's website was removed years ago when Oracle bought them. Just like FreeBSD's code didn't just port over to Linux for the ZFS on Linux project and clearly that project has major obstacles to overcome in some areas despite being derived from FreeBSD's code base. You may get the same final product on disk, but how you get there is different. But hey, that presentation does say ZFS "fails to maintain data integrity in the face of memory corruption". That sure sounds like something I would want to avoid.... such as with ECC RAM.

Nomad · Oct 17, 2013

I'd like to start by thanking you to take the time to explain all of this...

I'm going to do more research before I build my next box. I made some money on the "Prod" system and everything is back on dev.

I was thinking RaidZ2 Logically if it reads from disk and gets an error you'd think it would take the whatever 2 out of 3 said. 2 Parity + 1 in ram. It's doubtful that 2 would be wrong over 1 in ram so why write whats in ram vs what the other 2 drives are saying.

cyberjock · Oct 17, 2013

Yeah, it doesn't do a compare like that. The performance penalty could be enormous for regular operation. Besides, if you build with ECC RAM like the ZFS designers want then you don't have to engineer in any possibilities for having bad RAM. At some point you must trust some storage location. If you're trying to prove that the disks are safe at all times, then you have to trust RAM. Not to mention that normally when RAM goes bad it is across multiple locations and very ugly. It's usually not a single-bit error on an entire stick of RAM. Normally the conductor or insulator breaks down into adjacent memory locations and then things only get worse from there.

Sorry if I sound argumentative. I get very defensive about this because its the easiest way to lose your pool suddenly without warning. At least if disks start going bad with a RAIDZ2 you have some warning that things are going wrong and can try to backup some important data or replace a disk. With bad non-ECC RAM there is no warning.

Nomad · Oct 17, 2013

I've just never used ECC RAM before in any storage solution, save for the Cache on the Raid Card. Is that why I've never suffered a failure like you are talking about because the Hardware Raid Card is using ECC Buffer and system Memory can be Non ECC?

Other things I've been reading

ZFS saves checksums of data on the disk, and once the data is commited to the disk drive by ZFS, no data corruption can occur as long there is enough replicas/parity of your data.
ZFS cache the data in memory before it writes, so there is a minor chance for data corruption at that moment.

So this begs the question... If I create a new file on my Laptop... without ECC w/bad ram and save it to the ZFS with ECC it will still save bad and be corrupt right?

Important Announcement for the TrueNAS Community.

BUILD 300% More worth it?

Contributor

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Contributor

Inactive Account

Explorer

Contributor

Contributor

Explorer

Inactive Account

Contributor

Inactive Account

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "300% More worth it?"

Similar threads