Non-ECC and ZFS Scrub?

Status
Not open for further replies.

Knowltey

Patron
Joined
Jul 21, 2013
Messages
430
Basically just been reading all over the forum today. Can't remember where exactly I was reading, perhaps I just misunderstood it, but a post seemed to imply that the ZFS scrub can't even function without ECC Ram as it relies on it's function to actually detect and correct detected errors? Is that correct?

If not, another post I was reading seemed to imply that running a scrub on non-ECC RAM with a bitflip would basically snowball corrupt the entire pool, that correct?

If either of these are true, would it be safer and/or more processor efficient to simply just disable the scrub feature until I acquire ECC RAM?

And yeah before you get into it I've read the ECC topics and know all the risks etc and I currently have an upgrade plan I'm saving for to get up to an ECC CPU/Mb/RAM combo, but for now I have to wait, until then I'm just making multiple full backups manually.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You misunderstood. Scrub works fine without ECC RAM.

The problem is that if ZFS reads a block and sees an error, it will actively try to correct it. This happens during scrubs or any other reads from the pool. So anything that appears to be a disk error but is actually a memory bit error is very bad. ZFS also lacks anything like a "fsck" to repair damage to the pool; pools are expected to be consistent at all times. So if you get bad RAM, ... well, you can have a bad day real quick. ECC is the most reliable cure.
 

Knowltey

Patron
Joined
Jul 21, 2013
Messages
430
Ok, cool, so first part was just misread. On the snowball part, if say a bitflip is ocurring, it seemed that what I was reading was saying that since the bitflip would cause the checksums to be wrong on everything it'd basically start "fixing" everything and just fixing it with garbage?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yup. That's the part that's terrifying. Scrubs are basically just a case of "everything on the pool gets read" and it could reasonably be argued that it is an excellent opportunity for things to go massively wrong.

A machine with bad memory that isn't reading anything from its pool is fine. A machine with bad memory that is reading stuff, for any reason, be it scrub, "read only" NFS mount, or anything in between, is in danger.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
Yup. That's the part that's terrifying. Scrubs are basically just a case of "everything on the pool gets read" and it could reasonably be argued that it is an excellent opportunity for things to go massively wrong.

A machine with bad memory that isn't reading anything from its pool is fine. A machine with bad memory that is reading stuff, for any reason, be it scrub, "read only" NFS mount, or anything in between, is in danger.
So would it be safest to disable Scrubs altogether if my RAM is non-ECC? I would have to replace motherboard and CPU as well as RAM to get ECC, and for financial reasons that's not going to happen for at least a few months.
 

Knowltey

Patron
Joined
Jul 21, 2013
Messages
430
So would it be safest to disable Scrubs altogether if my RAM is non-ECC? I would have to replace motherboard and CPU as well as RAM to get ECC, and for financial reasons that's not going to happen for at least a few months.

Yes, safer to disable until you have ECC. That's basically what I was coming in here to check on, while my question wasn't as direct as yours it's essentially the same for all intents and purposes, I was just asking it more based on the details of why. But yeah basically if you don't have ECC RAM and you have a bitflip going on in your RAM there is a possibility that a scrub will snowball corrupt your entire pool. The scrub reads the data and verifies it, but if the verification fails (which it very likely will) it will attempt to fix the data, but since the bitflip is there it'll fix the data with garbage instead. So while a scrub is designed to protect the integrity of the data on your pool, if you have a bitflip in your RAM it'll do the exact opposite.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm really confused. You can't outsmart ZFS with non-ECC by choosing to not use scrubs. Scrubs are required to ensure the data integrity of your pool. Choosing to not do scrubs is like deciding you won't ever change your car's oil because you're scared you might get debris in the engine. Its something you HAVE to do whether you like it or not. You have to do regular scrubs if your data is important. And let's face the truth here.. you didn't choose to go with ZFS because you don't care about your data.

In short, do it right or not at all. Anything less is risky and you are totally at your own peril. Keep in mind that if you have a disk fail and you need to resilver, you will literally be doing a scrub(just renamed to "resilvering"). So there's no getting out of doing scrubs. ZFS does its own scrubbing of the data as you access it. A "scrub" is nothing more than traversing your entire data structure in the pool to check for anomalies.

I can't tell you how many haven't done scrubs and lost data as a result, but I don't submit to any notion that if you are using non-ECC RAM the solution is to not do a scrub. For different reasons(but still conceived under the idea that your data is important) it's important that you use ECC RAM just like its important you do regular scrubs.
 

Knowltey

Patron
Joined
Jul 21, 2013
Messages
430
Yeah, I'm just saving up for the ECC upgrade at the moment since my current build is mainly just a recycle build and I'd have to upgrade CPU/mobo/RAM to ECC support as I understand. Just figuring preventing as much risk as possible until then is best route to go. As I said above I'm keepng multiple manual backups of all the data elsewhere that never even touch the ZFS system in the meantime as a precaution as well until the ECC upgrade.

**So anyhow, I take it from your post though that you're saying that keeping scrubs still enabled until then is the best practice until I get the ECC? (and really now that I think about it since I have all the data independently backed up it would just be a minor nuisance in the event it happened before then)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
**So anyhow, I take it from your post though that you're saying that keeping scrubs still enabled until then is the best practice until I get the ECC? (and really now that I think about it since I have all the data independently backed up it would just be a minor nuisance in the event it happened before then)

To be honest I'm not sure what I'm saying to you. You're asking to pick between 2 very bad options that are self inflicted and then asking for which is worse so you can avoid that situation. The reality is that the whole problem is self inflicted and should be resolved instead of asking to choose between the bad options. To me, it's like a compulsive gambler ending up broke and homeless and when you give him $100 so he can have a few warm meals he thinks "gee, if I go to the casino I could end up with more than $100, so I should definitely do this". The reality is that it's still just a dumb idea to continue chasing this pipe dream. You really are gambling by choosing to not do scrubs and/or by using non-ECC RAM. Sure, some people do win at the casino. And just like the casino there's plenty of users in this forum that still unwisely use non-ECC RAM and never have a problem.

If I were in your shoes I wouldn't even be using ZFS to store data with non-ECC RAM except to prove that I can manage FreeNAS properly(aka test system with no real data). For the same reasons that scrubs can trash your data with bad non-ECC RAM, just using the pool for day-to-day activities can trash it too since ZFS does its own self-repair as you use it. I won't build systems for friends that use non-ECC RAM. The reason is simple. As soon as you start using non-ECC RAM you are compromising one of the pillars of the design of ZFS. Why would you deliberately do this to yourself then turn around and lie to yourself and say "but my data is safe". Reality check, its not.

I'm not really sure why you are even spending time trying to maintain a FreeNAS server with any data on it except for testing. And if you really truly are only testing it then why do you even care about any of this? The reality of it is that testing is exactly that. It's testing. As in, "I plan to destroy the pool before I go to production anyway." So who gives a poop if you do scrubs or not, or if you plan to use non-ECC RAM, or not. It's either for testing or its not. And the fact that you are trying to figure out which is worse between non-ECC RAM and not scrubbing tells me you aren't using it "just for testing".

In short, I'm really not seeing how this is some kind of "meet me 1/2 way" negotiation with ZFS. Whether your pool dies a horrible death from using non-ECC RAM that fails or because you chose not to scrub, your data is still just as gone.

To me, you should stop and see if you are lying to yourself. Because from my perspective you are compulsively lying to yourself about it being "for testing" or "for production" and what those phrases entail. Either its "for testing" and we shouldn't be discussing which is worse or its "for production" in which case you've made one grave mistake in hardware and are trying to figure out how to fix it with software options(hint: you can't).

No offense to anyone intended with this post. Just trying to be honest for your own sake, your own data's sake, and your own sanity.
 

Knowltey

Patron
Joined
Jul 21, 2013
Messages
430
If I were in your shoes I wouldn't even be using ZFS to store data with non-ECC RAM except to prove that I can manage FreeNAS properly

Yeah, pretty much what I'm doing, just playing around with various configurations to see what works best for me before spending actual money, using actual data to simulate realworld use, but it's data that I can willingly nuke from the NAS at a moments notice (and have) and have safely elsewhere.

I'm completely aware of all the risks of non-ECC and what-not which is why I said I don't intend to use non-ECC for much longer (basically until I can afford to get an ECC setup which should be maybe a month or two at most.)

In terms of the production/testing, it's sort of a half-half on that. It's testing in that I set up the machine to see how I like FreeNAS and what-not before spending money on a proper FreeNAS hardware setup and I have all of the data on 3 other independent machines as well (prior to said data ever touching FreeNAS). But it's production in the sense that I use the FreeNAS for actually accessing the data now (as a way to test performance etcetera) and that the data I'm using would be what I would continue using once I have a proper setup.

Only reason I care I guess would be the inconvenience factor of migrating data back in in the event of a corruption, but since all of the data is stored in multiple backups that are completely independent of the FreeNAS (ie data in said backups never even touches the NAS), losing the pool to corruption really would be just a minor inconvenience of a 2 hour re-migrate. (And yes I've already done a pool loss simulation the other day by just going in and nuking the drives. I'm back up in ~2 hours) So yeah I'm just really being obsessive over something that in these terms really isn't a concern since it's only about a month or two longer that I'll even be using non-ECC anyhow. (That's just my personality though to obsess over stuff I shouldn't). So yeah I guess this topic is irrelevant in the end.

So yeah, good perspective there from you, gets me to think about why I'm worrying and if I should. Thank you.

That said I may have a quick question about an ECC build in a second just gotta find that post with a build from iirc DrKK somewhere first.
 

kzrussian

Cadet
Joined
Dec 4, 2013
Messages
3
After reading a whole bunch, I had the exact same question about disabling SCRUB. Thank you for answering that. :)
Here is my response to "In short, do it right or not at all.":
I don't have ECC ram and therefore wanted to setup a RAID10 using UFS filesystem. The problem I ran into is that GEOM (gstripe and gmirror) is not compatible with GPT whole disk partitions.

Now I'm trying to figure out if using ZFS without ECC would be worse for my data then using UFS.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That's not a question with a simple answer.

Personally, after you factor in the fact that most people aren't familiar with FreeBSD and the potential risks of ZFS with non-ECC I'd think that using an OS you are familar with and the file system that it uses natively would be safer.

Let's face it, if you had some pool problems next week do you think you could solve those problems on your own? Are you REALLY going to rely on total stranger to give you the correct answers to save your data? What if we all up and left tomorrow?

When I went with FreeNAS back in February 2012 I knew nothing about FreeBSD except it was harder to build a system for and much harder to maintain than Windows. That was it. I wasn't about to use FreeNAS with my precious data until I was sure I could handle things on my own. I spent a month solid working on FreeNAS trying to ensure I knew what to do for every possible scenario. I don't want to rely on someone else to give me the answer(in case they don't) and I don't want them to accidentally give me the wrong answer either! Plenty of people here give bad advice at the wrong time and before you know it you've damaged your pool in a way that is undoable. Of course, the bad advice givers won't lose sleep over it, but I can bet you will.

So really stop and ask yourself what the goal is. If it isn't to "store my data as safely as possible" and that still makes you think ZFS with non-ECC over Windows, feel free to go with it. I know if I weren't using ZFS with ECC RAM I'd be using Windows. At least I know what I'm doing with that OS, even if it isn't the fastest or cheapest.

What you need to do is think about what you think is safest for YOU. That's the OS and file system you should be going with. Whether that's using Windows with NTFS, learning FreeNAS with ZFS, or paying someone to maintain your FreeNAS and ZFS server for you.

YOU need to do what is best for YOU.
 

tingo

Contributor
Joined
Nov 5, 2011
Messages
137
Does anybody have any real evidence of memory problems (preferably with non-ECC memory) crapping up a zfs pool?
I have been running machines for more than ten years now, never with ECC memory, and I have never had a problem with memory that caused any harm to filesystems. I have only been running zfs for few years, so my experience with it is a lot shorter.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
No. But we've seen people who have lost pools, and who were told to test their memory, and found it to be faulty.

So connect the dots. Why are we discussing this, again?
 

Knowltey

Patron
Joined
Jul 21, 2013
Messages
430
Does anybody have any real evidence of memory problems (preferably with non-ECC memory) crapping up a zfs pool?
I have been running machines for more than ten years now, never with ECC memory, and I have never had a problem with memory that caused any harm to filesystems. I have only been running zfs for few years, so my experience with it is a lot shorter.

No, but read up on how ZFS works and it's quite clear how that happens. When ZFS reads you data it performs a checksum on it. If you have bad RAM that checksum will calculate to be incorrect. The machine will then attempt to repair this with "correct" data, but because of the incorrect bit in ram the "correct" data is actually just corrupted garbage, so it writes that garbage to the disk assuming it's correct and gives it to you. Now in the event of a ZFS scrub it is reading every bit of data on the entire pool. performing said checksum on every single bit of data on the entire pool, and as such "correcting" every bit of data on said pool. Voila! Now ever single bit of data on your entire pool is complete garbage.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It's just as jgreco said. We've had lots of people that just up and had pool problems suddenly without warning. Eventually the only thing "wrong" was bad RAM, and their pool(and backups if they had them) suffered horribly. If you can't put 2 and 2 together, feel free to use non-ECC. But I'll close my browser tab when you come here begging for help as I already know its an exercise in futility. There is no fixing that. We've tried. We've failed every time.

And why are we discussing this AGAIN? Makes no sense to me.
 

ZFS Noob

Contributor
Joined
Nov 27, 2013
Messages
129
The main site I'm running now was running on a single server a decade ago. That server had an error light that would come on if it encountered a checksum error in memory, and that happened once in the 2-3 years I used that server. Call it "cosmic radiation" if you like, but "randomness" works, and memory chips are physically smaller now so I don't know if that increases or decreases the odds of a single bit getting flipped randomly. It's not a common thing, but it happens, and when it happens it has consequences, which is why people talk about it.

I've seen dozens of computers with memory that just "went bad" one day and started causing odd problems. When nothing else made sense, I'd change the memory with stuff I knew was good and a surprising percentage of the time that would resolve the strangeness a client was seeing.

ECC RAM isn't nearly as expensive as it used to be. I just bought a used server to test with: $800 for a dual-Xeon Dell R710 with 72G of RAM and dual power supplies. The SAS 6/i cost an additional $29, and I needed to provide my own hard drives and trays (the trays are ~ $14 each on Amazon for new knock-offs, by the way). That's not terribly expensive for a NAS/SAN platform. Do you value your data more than $800? Is so, spend the money. If not, don't worry about it - maybe you'll be fine. Or maybe you'll buy a 4TB USB drive and remember to back up all your data to it every month over the network...

(This isn't an argument for the 710 by the way. The SAS 6/i works after flashing the BIOS, but I couldn't get it to recognize the SSD I'm planning to use for a SLOG, so I've got a PCI card that a SSD mounts to coming in the mail instead...)

I think the argument here is this: SUN designed ZFS with some assumptions about the hardware that would run it (quality Sun-based stuff), and where it would live (clean, constant power, or a UPS). Those design decisions meant a whole lot of stuff could be done away with that other filesystems require. Like fsck ability. I can't speak to the resilient nature of ZFS (check my username), but my plan if a Zpool goes down on me is to do a quick restore of my VMs to get up and running ASAP, either on that machine or another one. That's why I take hourly backups. If most ZFS users think like that, then developers can work on cool hacks like compressing the ARC data to increase the size of the ARC beyond the size of physical hardware, rather than building tools to recover from conditions that shouldn't happen in the first place.

I kind of like it that way, but I've got more resources than I used to so maybe I'm elitist.

(By the way, what are you really giving up by not using ZFS in your deployment? Do you need high IOPS, or is maxing our your gigabit connection while streaming movies enough?)
 

Knowltey

Patron
Joined
Jul 21, 2013
Messages
430
> That's why I take hourly backups.

Except if it takes you longer than an hour to notice the issue you're fucked as the backup will also be bad. And if you're talking about ZFS Snapshots as "backups" than those will be fucked up even if they were taken well before the corruption in RAM occurred.
 

ZFS Noob

Contributor
Joined
Nov 27, 2013
Messages
129
> That's why I take hourly backups.

Except if it takes you longer than an hour to notice the issue you're fucked as the backup will also be bad. And if you're talking about ZFS Snapshots as "backups" than those will be fucked up even if they were taken well before the corruption in RAM occurred.

If I could only go back one backup, you'd be right. Luckily for me my backup software is a bit more flexible than that...

Regardless, backups are important. And a good backup can be a life-saver. If you design your system well you can choose any hourly backup in the last two weeks, any daily backup for the prior two months, and so on and so forth back an arbitrary distance in time.

Re: Snapshots as backups. I like the concept - if your server gets hacked you can quickly restore pre-hack. It's not really a backup in my mind though, because loss of your primary storage = loss of the backup as well, which defeats the purpose. I like snapshots on the primary SAN, another SAN either as failover or as an idle device ready to step in when necessary, a local backup server, and an off-site backup server.

But I've been accused about being paranoid about data before, too...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Assuming we are still talking about non-ECC going bad with ZFS:

How are you supposed to do backups when your file data will be read into RAM, trashed by your bad RAM, then handled by your given app(rsync, cp, ESXi tool, whatever)? The WHOLE problem is that ZFS will crap all over your data before any application can use it. Garbage in = garbage out.

Rsync won't work in your situation since you have live files updating constantly, but rsync will blow its top and trash your entire backup system because every file you read from the source machine will be trashed in RAM, then pushed over to your backup server where it will be "updated" with your trashed file. Thank you rsync! LOL. That is precisely how backups get fscked up while you're trying to be smart and have good reliable nightly backups like any good IT person would.

ZFS snapshots end up with corruption closely inline with how rsync does it.

But surely you aren't using rsync because VMs are being constantly being updated.

Snapshots + Replication is an excellent backup. That's WHY it was created with ZFS. Can't be building a yottabyte file server if you can't build a good way to back it up, right?
 
Status
Not open for further replies.
Top