Raidz2 recovery?

Status
Not open for further replies.

Xavier

Cadet
Joined
Jan 31, 2014
Messages
4
I am a little puzzled as I realize I don't fully understand ZFS. I have a Raid-Z2 pool that is made of 6 disks, and just encountered some silent corruption. Only one disks has a few errors (2 files affected), so I assumed that FreeNAS would be easily able to correct these corruptions using the checksums and the other 5 disks. Where am I wrong?

If that helps, here is the output of spool status -v

[root@freenas ~]# zpool status -v
pool: volume1
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 32K in 16h8m with 2 errors on Sun Jan 12 16:08:17 2014
config:
NAME STATE READ WRITE CKSUM
volume1 ONLINE 0 0 2
raidz2-0 ONLINE 0 0 4
gptid/31ee726b-980b-11e2-8ff4-f46d0492b60e.eli ONLINE 0 0 0
gptid/3270619b-980b-11e2-8ff4-f46d0492b60e.eli ONLINE 0 0 0
gptid/32f07a47-980b-11e2-8ff4-f46d0492b60e.eli ONLINE 0 0 0
gptid/33714a62-980b-11e2-8ff4-f46d0492b60e.eli ONLINE 0 0 0
gptid/88682726-67a3-11e3-8b71-f46d0492b60e.eli ONLINE 0 0 1
gptid/6db4e21d-654e-11e3-ac49-f46d0492b60e.eli ONLINE 0 0 0
errors: Permanent errors have been detected in the following files: volume1/multimedia@auto-20131216.0015-2y:/Movies/Somefile volume1/multimedia@auto-20131216.0015-2y:/Movies/Some other file

The particular corruptions don't seem to matter much (it's the backup that is reported as corrupted, right?), but I want to understand why FreeNAS can't simply correct the issue.

As a separate note, I am using non-ECC RAM (I built my server based on a then recommended Asus E35M-1I. How bad is that really, given I have 2 disks of redundancy, run weekly snapshots, monthly scrubs and have SMART monitoring turned on?
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Looks like some checksum errors.. Did you ram test before building system? Maybe drives are getting flakey or something was weird on the NAS for a bit? Did you do something to the snapshots? Please post in code tags..

Code:
Permanent errors have been detected in the following files: volume1/multimedia@auto-20131216.0015-2y:/Movies/Somefile volume1/multimedia@auto-20131216.0015-2y:/Movies/Some other file


See ya,
 

Xavier

Cadet
Joined
Jan 31, 2014
Messages
4
I understand the error, I just don't understand why FreeNAS can't correct for it. As far as I understand, only 1 disk has errors, and I have 2 redundancies. Why doesn't FreeNAS use these to correct the issues? Or has it, and is it just reporting that it did?
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
Why doesn't FreeNAS use these to correct the issues?
Probably because there is something wrong:
Corrupted file, hardware failure...
You have to troubleshoot.
Check your file
Check your drive
Check your RAM
Check SATA cable
...
 

tio

Contributor
Joined
Oct 30, 2013
Messages
119
As a separate note, I am using non-ECC RAM (I built my server based on a then recommended Asus E35M-1I. How bad is that really, given I have 2 disks of redundancy, run weekly snapshots, monthly scrubs and have SMART monitoring turned on?

Theres your issue.

FreeNAS needs to rely on the data it gets from ram as 100% safe. ECC does this, non-ECC does not and you've now got corrupted files and a corrupted checksum as the RAM has told FreeNAS the data and checksum are ok, even though its corrupted.
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
Theres your issue.

FreeNAS needs to rely on the data it gets from ram as 100% safe. ECC does this, non-ECC does not and you've now got corrupted files and a corrupted checksum as the RAM has told FreeNAS the data and checksum are ok, even though its corrupted.
This is assumption. But it's possible.
 

tio

Contributor
Joined
Oct 30, 2013
Messages
119
That's fact what can and does happen with non-ecc RAM, not assumption.
 

Michael Wulff Nielsen

Contributor
Joined
Oct 3, 2013
Messages
182
This is what I would do, shut down FreeNas. Check the memory (run memtest86+ for a long long long time), replace if nessecary (or actually I would replace anyway). Verify that you have backup that is good.

When you are a 100% sure that the ram is good, then run the smart tests.

If disk and memory checks out I would consider scrubbing the pool and afterwards restore all known good files from backup.
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
That's fact what can and does happen with non-ecc RAM, not assumption.
Xavier doesn't provide any result that shows clearly faulty RAM. That's why I say you are making assumptions.
 

Michael Wulff Nielsen

Contributor
Joined
Oct 3, 2013
Messages
182
Actually Xavier is not providing any information about his ram-status, only that it is not ECC.

So I will naturally assume that it is very likely that some memory has gone bad. One thing I have learned is that all non ecc-memory must be considered broken until proven "not broken".
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
One thing I have learned is that all non ecc-memory must be considered broken until proven "not broken".
Agree with that.
So we need facts(=results), not only on his RAM, output of smart will be helpful too.
 

tio

Contributor
Joined
Oct 30, 2013
Messages
119
Xavier doesn't provide any result that shows clearly faulty RAM. That's why I say you are making assumptions.
A single bad bit in ram will cause a bad unrecoverable file. This doesn't mean his RAM is bad but has no protection in the event of a bad bit. Even of one of his HD's was failing ZFS would be able to recover in the event the checksum was valid, in this case its not and the only really plausible way it will happen is with non ECC hardware.

There's a checksum error, if you can explain how else it can have gotten there I'm happy to read.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Bad cables / bad ram etc.. Pretty much summed it up.. I've had lots of checksum errors building non-ecc / testing non-compatible hardware.. It's an absolute must to test non-ecc ram i'd say for atleast 10 passes.. It might take along time but it will hit every bit on that ram stick a few times.. I would also use Memtestx86+n(the new version 5.0.x) if possible.. Please confirm if snapshots were deleted etc..
 

tio

Contributor
Joined
Oct 30, 2013
Messages
119
Microserver N54L with a bracket kit like this to go in the optical drive bay. It also takes 16GB of DDR3 ECC Unbuffered RAM as well via 2 chips and won't waste power.

http://www.amazon.co.uk/NoiseBlocker-Anti-Vibrations-HDD-Mounting-NB-X-Swing/sim/B000S8B8J6/2

Flash the modified BIOS for the N54L to enable AHCI on all ports as well and wire them up.

I have the exact same setup. The CPU never reaches beyond 80% even when its at full throttle running god knows what. The exception is transcoding which my box doesn't do.

Also you don't need SATA 3 for spinning HD's, they never go beyond the bandwidth of SATA 2.
 

Michael Wulff Nielsen

Contributor
Joined
Oct 3, 2013
Messages
182
Dont worry about sata3 with spinning rust. A SuperMicro is always a safe choice. :)
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
The one I've found is http://www.supermicro.com/products/motherboard/Xeon/QM77/X9SPV-M4.cfm, but only 4 of the 6 SATA are SATA3.
The X9SPV's (any of them) are super expensive. Yes it includes the CPU, but I would not pay ~700USD for a mobo + CPU. The only two mITX boards that make sense for FreeNAS are probably:
http://www.asrock.com/server/overview.asp?Model=E3C224D2I
http://www.asrock.com/server/overview.asp?Model=E3C226D2I
I think some forum members are using them (or at least did experiment with them) so search the forum to find out more.
 

Xavier

Cadet
Joined
Jan 31, 2014
Messages
4
Thanks a ton. I indeed chocked a bit when I found out about the SuperMicro price, which is basically more than my whole system (excl. drives)...
The E3C226D2I looks perfect!
 
Status
Not open for further replies.
Top