zfs - non ecc bit flip risk

Status
Not open for further replies.

berd1921

Dabbler
Joined
May 13, 2013
Messages
11
So, I found some corrupt photo files on my freenas zraid system. After doing smart tests and finding nothing I read about how ZFS is susceptible to errors in ram. I'm going to run memtest overnight to see if I have failing ram, but now I'm worried about my non ECC system in general.

So, my question is this: is my system only susceptible to bit flipping in non ECC ram during writes? For example, if I copy some data to my system, and then verify that it is clean will it be safe "forever" if I don't write over it? Or is it possible that during some zfs maintenance or other reading of the data that ram errors could be added into my data?


I have multiple backups of everything, but now I'm worried about corrupted files getting copied to all my backups and not being detected before it's too late.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There's 2 parts to this. If the data is saved is valid, then its valid.. until there is an opportunity for it to be rewritten.

I believe, if you write data (aka another file is saved or updated) then you run the risk of corruption because the parity and CRCs will be rewritten.

Additionally, if you do those regular scrubs as is recommended by ZFS' creators, then you run the risk of corrupting the entire pool because as everything is verified (and potentially updated if incorrect), that bad RAM could make everything on the system be bad because if that 1 flipped bit.

So it's really hard to say when your data is save and when its not. If you are that concerned about it.. time to upgrade to ECC RAM.

I've had picture files that appear corrupt because one program couldn't display it properly, but another did. So you may not have any corruption at all, it could be your application. It cold also be that your file was corrupted before moved to the server. The whole problem with corruption since computers were invented has been identifying exactly "where" corruption is coming from.

And your fear of corrupting backups is completely normal and logical. The catch.. if you really want to be safe.. use ECC RAM and stop worrying about it. Several people have lost everything from bad RAM. I'd never build a FreeNAS server without ECC ever again.
 

berd1921

Dabbler
Joined
May 13, 2013
Messages
11
The whole problem with corruption since computers were invented has been identifying exactly "where" corruption is coming from.

I ran memtest for 13 hours with 0 errors so I don't think my non-ecc ram is responsible for corrupting three photos out of my last shoot of 3000. I believe that they were corrupted because going back to the memory cards and recopying them to my PC fixed the issue.

I'm still sold on moving to ECC, but that is not my immediate problem (i don't think). My workflow is importing my canon raw .cr2 photo files via a usb 3.0 card reader using lightroom onto my freenas system. If I've run memtest on my freenas system, smart tests on freenas and chkflash on my memory cards and everything is coming back healthy any suggestions on what other hardware I should suspect?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
If I've run memtest on my freenas system, smart tests on freenas and chkflash on my memory cards and everything is coming back healthy any suggestions on what other hardware I should suspect?

As I've said, its darn near impossible to determine the actual location without testing everything you possibly can until you find the culprit. Remember, there's about 20 steps involved with moving data from an external USB device on a different machine to safely stored in your zpool. Any one of those points could have been the culprit, and any one of them could have had a problem due to solar flares, EM interference, dirty power, hot temperatures that day, etc. For all we know the corruption is from your desktop and not the server, since the data did have to pass through your desktop to get to the FN server.

I'm not sure how many passes of memtest you got in 13 hours, but typically 3 is considered "good". It is documented that solar flares can cause bit flip in RAM. This is one of the selling points for ECC. You typically don't have to worry about such things unless its so bad its causing multiple bit-flips(in which case I've seen the system crash with a BIOS message that the RAM had multiple bit errors) so its obvious what is wrong.
 

berd1921

Dabbler
Joined
May 13, 2013
Messages
11
Turns out it was the desktop ram. It was corsair only four months old and not OC. It threw 20 errors in the first 30 seconds. It's nice to find the culprit!
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
I was in the process of composing a detailed message to ask your DAM (digital asset management) workflow, thinking the problem might be in that process. I'm glad to see that you found the source of the problem.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Turns out it was the desktop ram. It was corsair only four months old and not OC. It threw 20 errors in the first 30 seconds. It's nice to find the culprit!

Wow. I guessed that one just as an example that the issue doesn't have to be your server. Impressive. Glad you fixed it though.
 
Status
Not open for further replies.
Top