Turning checksum off?

dkran · Apr 19, 2014

So, I have a question and I'm sure I'll get a variety of "you're an idiot" responses, but it's where I'm at.

I have a freenas box with the critical mistake of starting on non-ecc. So I know where this conversation will head. I can build another one that I have in my amazon cart in around 30-45 days. Obviously (which I never really read when I was looking around pre-installation, idk how), the scrubbing seems to occasionally be corrupting data due to memory or drive errors. The drives really shouldn't have been used in this thing ever, but surprisingly they at least seem pretty stable. The data is inconsequential, it's some video that can easily be rebuilt, but I'd prefer to keep it and not have to do that, so let's consider the data critical. plus if I had to set it up again I'd do it differently anyway, so rebuilding wouldn't be the worst.

aaanyway.

I have turned off scrubbing and zfs set checksum=off on my entire pool. now, I know this is kind of dumb, but is my pool basically functioning as a non zfs drive with no data verification at all and no way to really scan it? I know it's pretty highly unstable, but if sort of okay, is it any more unstable than say your average windows computer unscanned?

For anyone interested, new build is

WD RE 4TB sata3 x 2
Intel xeon QC E3 1230
fractal design arc mini r2
SUPERMICRO MBD-X9SCM-F-O

32 gb kingston value ram ecc ddr3 4x8gb, 1333mhz.

I think I was thinking a little better with that one haha. Thanks for any help guys. just trying to possibly get it through until I move and stuff.

cyberjock · Apr 19, 2014

I won't even try to entertain the question "is it any more unstable than say your average windows computer unscanned?" because it's apples to oranges. There is no scanning that is anything remotely like zfs' scrub. There's no such thing on *any* other file system anywhere.

Scrubbing is not and never will be responsible for corruption. If you have corruption that is occuring during scrubbing, your hardware is faulted... not the scrub.It's like saying I put a band-aid on a wound and because I put a band-aid on it I bled to death. If only I had not put a band-aid on it....

To be honest, and you probably know I'm about to say this, I'm really not sure why you are using ZFS at all. You chose to go with ZFS for data protection, but the errors are inconvenient so you've turned off the feature that warns you about problems. This is literally like saying "my check engine light comes on when I drive my vehicle, and that sh*t is annoying, so I unplugged the cable so I don't have to see it anymore". Does that make *any* rational sense at all? No, but that's exactly what you did. Why you'd then say in the same post that "the data is critical" while doing what you did is laughable at best and flat out stupid at worst.

cyberjock · Apr 19, 2014

Oh, forgot to add this...

So what you can expect is that at some point you'll have ZFS metadata corruption. It will probably cause ZFS to crash or the kernel to panic. You'll reboot the server and the zpool will be gone. When I say "gone" I mean no data is accessible as in 100% data loss. You can't use recovery tools on ZFS, so you are just S.O.L. So for "critical" data you sure aren't doing anything remotely aligned with that philosophy.

dkran · Apr 19, 2014

Yes, I know that's what I did. and I know zfs wasn't faulting, it's semantics, man. see, I believe memory errors were confusing the scrub and causing it to fix data that didn't need fixing. the reasons I did what I did: Since the data is really whatever, I figured I'd do whatever I could to prolong it. if I have some weird hardware errors, which I know doubt do because files were corrupting.

A:) ZFS Checks files on every read against a checksum. if it's off, it attempts to correct it.
B:) My hardware was faulty, causing it to fix things that probably weren't wrong
C:) scrubs check ALL the crap, and therefore probably fixed more things that weren't wrong, or those errors were unrecoverable and the data was lost anyway. So if I'm losing it anyway, why scrub it? I'll just figure it out after I change the hardware.
D:) the files are still being checked against a checksum, so I turned that off too, to minimize the number of memory and checking interactions going on, which are only accessing the files more, which is only increasing the potential of a error in some minute way. so yes, I realize the thing is unstable, but it's not as simple as you claim in logic either.

dkran · Apr 19, 2014

cyberjock said:
Oh, forgot to add this...

So what you can expect is that at some point you'll have ZFS metadata corruption. It will probably cause ZFS to crash or the kernel to panic. You'll reboot the server and the zpool will be gone. When I say "gone" I mean no data is accessible as in 100% data loss. You can't use recovery tools on ZFS, so you are just S.O.L. So for "critical" data you sure aren't doing anything remotely aligned with that philosophy.

I understand the pool may, and probably will die. What I'm asking, is if there is any less of a chance of that happening on faulty hardware with scrubbing and checksum off than with it on?

joeschmuck · Apr 19, 2014

Simple answer is no. Your data will always be at risk with faulty hardware no matter what system you are running.

dkran · Apr 19, 2014

joeschmuck said:
Simple answer is no. Your data will always be at risk with faulty hardware no matter what system you are running.

I think what I'm asking is if it's happening more slowly now. I view the thing as something that is essentially just rapidly corrupting data (relatively), and I'm trying to minimize the rate of that, understanding that it's essentially going down.

for the record I'll try to chronicle what happens here, and if I lose or survive. Maybe I'll run some tests on the hardware after I switch it up to see how bad it actually was. Also for the record, nobody do what I did. It's a waste of time and money :)

joeschmuck · Apr 19, 2014

The point is, if you have a faulty system you could be writing data incorrectly and reading data incorrectly. This has nothing to do with ZFS at all. If you were running UFS or NTFS or FAT the results would be the same.

cyberjock · Apr 19, 2014

dkran said:
for the record I'll try to chronicle what happens here, and if I lose or survive. Maybe I'll run some tests on the hardware after I switch it up to see how bad it actually was. Also for the record, nobody do what I did. It's a waste of time and money :)

You'll never know how bad it was.. the very feature that tells you how bad it is.. checksums.. has been disabled!

dkran · Apr 19, 2014

cyberjock said:
You'll never know how bad it was.. the very feature that tells you how bad it is.. checksums.. has been disabled!

I mean the hardware itself, not the data. The data on the new box will have checksums and scrubs back on. I know that won't protect against past data loss, but it can prevent future. Or maybe I will make a brand new pool and copy the files. Either way I'll have to copy the files, and if the pool metadata is corrupted, at least I can guarantee the new pool metadata won't have any corruption, just the data possibly. And if I encounter a bad file every once in a while, I'll replace it with a fresh copy

joelmusicman · Apr 19, 2014

I think the point others are making is that the data itself will be trashed, and the new system will generate checksums based on the bad data. Even once the data is migrated to a good system it will still be bad, and there's ABSOLUTELY NO WAY to tell until you try to open the file.

DrKK · Apr 19, 2014

Yeah.

Consider your pool trashed, now. That's the upshot.

dkran · Apr 19, 2014

joelmusicman said:
I think the point others are making is that the data itself will be trashed, and the new system will generate checksums based on the bad data. Even once the data is migrated to a good system it will still be bad, and there's ABSOLUTELY NO WAY to tell until you try to open the file.

Yes joel, thanks. I mean I've also stated that I understand that and have no problem replacing it. for some reason people fail to answer the simple yes or no question here. is it possible it's slowed down the actual rate of corruption, or not?

Instead, everyone is intent on saying "your pool is screwed". which it very well might be, but that is not the question, at all, so those responses are ignored, because they're stating a fact that was stated from the beginning.

joelmusicman · Apr 19, 2014

dkran said:
is it possible it's slowed down the actual rate of corruption, or not?

I have no idea how to actually accomplish this, but I think what we're having a hard time understanding is if you don't have a problem rebuilding the pool, why do you care about limping along and "slowing" the decay?

Anyway, as CyberJock mentioned, your hardware is causing the decay. Taking steps in ZFS only masks the symptoms.

What I would do in your case is to run memtest for 48 hours, remove/replace bad RAM, then run long SMART tests on all your drives. Basically, look for the underlying hardware issue and try to repair that before worrying about existing corruption. For all we know, it could just be a bad SATA cable...

dkran · Apr 19, 2014

I was considering that. I guess I'll hit the memtest and see. since I only have 4gb space I can really stand to lose a stick til I replace this box. I have run smart tests, long and weekly because I knew they are a ticking time bomb, and the drives seem to be fine if you want I'll post my smartctl results but it doesn't seem like its an issue. I have had a strange assumption that it's the ram.

I hit it for 24 hours before I ever installed freenas on the computer, but maybe I should try longer. it has been on for some time now.

edit: and I care about limping along and slowing the decay because I want to be able to watch movies for the next month or so when I have a few minutes. ;)

by mid may I should be able to easily put out the money for the right thing, and get this done with.

joeschmuck · Apr 19, 2014

dkran said:
Yes joel, thanks. I mean I've also stated that I understand that and have no problem replacing it. for some reason people fail to answer the simple yes or no question here. is it possible it's slowed down the actual rate of corruption, or not?

Instead, everyone is intent on saying "your pool is screwed". which it very well might be, but that is not the question, at all, so those responses are ignored, because they're stating a fact that was stated from the beginning.

I thought I answered is pretty straight forward with my first posting to this thread with the answer being "no" to your first question. I didn't simply say "no" to your second question but I tried to explain to you with some detail which apparently you only want a simple yes or no for an answer and care less about explaining the rationale. So to answer your second question on rate of corruption, yes you probably slowed it down because you are not running a sumcheck.

But does that really matter if you can't trust any of your data?

So assuming you have nothing of value on the FreeNAS device, no big deal unless you have programs you could run. I wouldn't trust an .exe file or .bin, etc... If it's just video files, no big deal, as you said you can just rebuild them.

As for how to treat your old files/pool in my opinion is to toss all your old data. Again, if it's movies well you can transfer those and if they work, great, but programs/applications and certain data would be best to throw away as running a corrupt application could harm the computer the application is being run on.

Just my opinion.

dkran · Apr 19, 2014

joeschmuck said:
I thought I answered is pretty straight forward with my first posting to this thread with the answer being "no" to your first question. I didn't simply say "no" to your second question but I tried to explain to you with some detail which apparently you only want a simple yes or no for an answer and care less about explaining the rationale. So to answer your second question on rate of corruption, yes you probably slowed it down because you are not running a sumcheck.

But does that really matter if you can't trust any of your data?

So assuming you have nothing of value on the FreeNAS device, no big deal unless you have programs you could run. I wouldn't trust an .exe file or .bin, etc... If it's just video files, no big deal, as you said you can just rebuild them.

As for how to treat your old files/pool in my opinion is to toss all your old data. Again, if it's movies well you can transfer those and if they work, great, but programs/applications and certain data would be best to throw away as running a corrupt application could harm the computer the application is being run on.

Just my opinion.

Okay, I thought it had slowed down. And no, I no longer intend to "save" the pool at the end of it once I rebuild the server. I will make a new pool and simply copy the videos and go from there. This is why I was looking for the information I was looking for. if this was at least slowing it down, that's fine for now. if I do lose the whole system that's fine, but if I can salvage the movies if I don't corrupt my metadata, then I'll just remake all my jails / settings / etc and have the videos. if a few don't play, no big deal. Thanks, I appreciate it.

joeschmuck · Apr 19, 2014

Not sure how much data you have but a cheap 1TB external USB drive would hold a lot of movies, well unless they are all 50GB in size. You could start moving that data off. If you can locate the faulty RAM, Power Supply, MB, CPU, Video, or whatever is causing the problems, you could rebuild your current pool and make use of the system until you do get your new system. It depends on how much effort you want to put into it.

solarisguy · Apr 19, 2014

Since it might be useful to others, I will mention a step that I consider better than checksum=off (it is not a solution, just a better step...). If you want to preserve your data, you should do for all your filesystems (except for .system )

zfs set readonly=on Your_Filesystem

You cannot do that just once for the zpool. When in the need for write access, toggle it back to off, then set readonly=on again. If you have .system, you need to remove it first. In order to not loose system messages, please setup a loghost that would collect events. There are free editions of a syslog server for Windows.

In your current setup, doing

zfs set atime=off Your_Filesystem

will also reduce the number of writes. And each write is a definite possibility of a corruption. You can do that to your filesystem(s), when temporarily enabling writes, too.

Important Announcement for the TrueNAS Community.

Turning checksum off?

Dabbler

Inactive Account

Inactive Account

Dabbler

Dabbler

Old Man

Dabbler

Old Man

Inactive Account

Dabbler

Patron

FreeNAS Generalissimo

Dabbler

Patron

Dabbler

Old Man

Dabbler

Old Man

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Turning checksum off?"

Similar threads