ZFS replication corrupts entire ZFS volume (Warning!)

DaVo · Jan 12, 2014

Hi all,

I have a huge Problem with replicating ZFS datasets on a remote server. So my intention of this (long) post is: Warn any ZFS-replication users AND (hopefully) get some help in what I could do.

Here are the Specs first:

Two Servers, both having AMD dual core 64 Bit CPUs

Both have 8GB RAM (non-ECC)

Both have 6x 2TB Drives installed and running (Raidz2 resulting in around 6TB of usable diskspace)

Both are located in different offices, connnected via VPN.

The configuration is as follows:

On each server, there is a local dataset wich is locally used as normal working set (Server 1 4TB; Server 2 2TB). Each dataset has a snapshot task. The resulting snapshots are dumped incrementally to files, which are then transferred to the remote server (herefore, the autorepl.py-script has been modified, so the commands stayed the same with the only difference, that the zfs send does not send directly over SSH, but to a file first, and the zfs receive does not read from an SSH-stream, but from the according file.

Code:

First:
replcmd = '(/sbin/zfs send -V %s%s%s > /mnt/ZFSVol/replDS/local/snap_temp/%s && echo Suceeded.) > %s 2>&1' % (Rflag, snapname, limit, snapname, templog)
and later:
replcmd = '(%s -p %d %s "/sbin/zfs receive -F -d %s < /mnt/ZFSVol/replDS/remote/%s && echo Succeeded.") > %s 2>&1' % (sshcmd, remote_port, remote, remotefs, snapname, templog)

The files are (depending on the size) transferred via WAN and rsync or via external drive). This is due to the possibility of resuming a broken transfer if the WAN-connection breaks in between - but this has not been used so far. Every file so far was transferred directly (via SSH-tunnel or with a transfer-drive).

The Problem:

On a first look, the replications are working perfect on both sides. ZFS is not reporting any errors and the datasets and snapshots are showing up as expected.

BUT:

Having a closer look to the replicated file system, one can see, that the meta-data is corrupted on some more or less top-level folders (deeper in the tree, there are no known errors so far).

Code:

[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# ll

total 405

ls: ./.: No such file or directory

drwxrwx---  10 admin  admin_grp  10 Dec 31 00:41 ./

drwxrwx---+  4 admin  admin_grp    4 Dec 29 21:28 ../

drwxrwx---+  2 admin  admin_grp    7 Dec 30 00:27 BACKUP/

drwxrwx---+  4 admin  admin_grp    4 Dec 30 13:52 DVD Birthday/

drwxrwx---+ 40 admin  admin_grp  40 Dec 30 14:00 Private/

ls: ./Photos: No such file or directory

drwxrwx---  20 root  admin_grp  20 Dec 31 01:16 Photos/

drwxrwx---+  2 admin  admin_grp  11 Mar 23  2012 usw/

drwxrwx---+  6 admin  admin_grp  16 Sep 22  2012 maps/

drwxrwx---+  2 admin  admin_grp  116 Dec 24  2008 email/

drwxrwx---+  4 admin  admin_grp    4 Dec 31 00:31 safe/

What's this? The folder, ls is currently listing is not existant (ls: ./.: No such file or directory)?

If I try to change the owner on these folders (only two of a couple of thousands), the behaviour is quite strange:

Code:

[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# chown root Photos/

[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# ll

total 405

ls: ./.: No such file or directory

drwxrwx---  10 admin  admin_grp  10 Dec 31 00:41 ./

drwxrwx---+  4 admin  admin_grp    4 Dec 29 21:28 ../

drwxrwx---+  2 admin  admin_grp    7 Dec 30 00:27 BACKUP/

drwxrwx---+  4 admin  admin_grp    4 Dec 30 13:52 DVD Birthday/

drwxrwx---+ 40 admin  admin_grp  40 Dec 30 14:00 Private/

ls: ./Photos: No such file or directory

drwxrwx---  20 root  admin_grp  20 Dec 31 01:16 Photos/

drwxrwx---+  2 admin  admin_grp  11 Mar 23  2012 usw/

drwxrwx---+  6 admin  admin_grp  16 Sep 22  2012 maps/

drwxrwx---+  2 admin  admin_grp  116 Dec 24  2008 email/

drwxrwx---+  4 admin  admin_grp    4 Dec 31 00:31 safe/

[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# chown admin Photos/

chown: Photos/: No such file or directory

[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# ll

total 405

ls: ./.: No such file or directory

drwxrwx---  10 admin  admin_grp  10 Dec 31 00:41 ./

drwxrwx---+  4 admin  admin_grp    4 Dec 29 21:28 ../

drwxrwx---+  2 admin  admin_grp    7 Dec 30 00:27 BACKUP/

drwxrwx---+  4 admin  admin_grp    4 Dec 30 13:52 DVD Birthday/

drwxrwx---+ 40 admin  admin_grp  40 Dec 30 14:00 Private/

ls: ./Photos: No such file or directory

drwxrwx---  20 admin  admin_grp  20 Dec 31 01:16 Photos/

drwxrwx---+  2 admin  admin_grp  11 Mar 23  2012 usw/

drwxrwx---+  6 admin  admin_grp  16 Sep 22  2012 maps/

drwxrwx---+  2 admin  admin_grp  116 Dec 24  2008 email/

drwxrwx---+  4 admin  admin_grp    4 Dec 31 00:31 safe/

Hum - ok. The first chown worked without problem, the second one is complaining, but works as well.

As well, it is no Problem to access all the data in the folders:

Code:

[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# cd Photos/

[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry/Photos# ll

total 714

ls: ./.: No such file or directory

drwxrwx---  20 admin  admin_grp  20 Dec 31 01:16 ./

ls: ./..: No such file or directory

drwxrwx---  10 admin  admin_grp  10 Dec 31 00:41 ../

drwxrwx---+  2 admin  admin_grp  50 Mar 30  2012 2009-07-04 Birthday/

(.......about 50 folders, all ok....)

Ok, I can CD to this folder, but it is not existant - but all of the sub-directories and data are.

So, everything more or less would be ok for the users (with the drawback, that these folders can not be accessed by any non-root-user, since the permission information is missing). But the backup is still there and can be used in case of emergency.

BUT (and here comes a warning to all ZFS-replication-users):

This ZFS-Volume is corrupted!! Altough ZFS is NOT noticing it (neither zfs scrub, nor SMART is reporting any errors), the work with this data can lead to a COMPLETE loss of the volume!! Last week, I tried to delete one of these corrupted folders, which lead to an immediate reboot of the server (tons of dump-information on the screen, then reboot). From there on, the server was not bootable anymore (always, when Freenas wanted to load the ZFS volume, it crashed). Only a deletion of the Volume made Freenas coming up again, but when I tried to import the ZFS volume again, Freenas crashed immediately. So I had 100% data loss without any prior warning of Freenas or ZFS!

OK, I already hear you saying: This is for sure a RAM issue.

But let me state this:

- This behaviour occurred twice on two different machines

- Both machines were tested with memtest86 for more than 24 hours (about 10 passes) without any errors

- As far as I understood ZFS, a RAM-Problem would cause quite some damage to the stored data. Now think about, that I copied almost 3TB of data to the machines, without ANY single Bit-error (compared after copy with Beyond Compare, Binary compare). So the chance, that a RAM-issue is not affecting just a single bit from 24 billions, but only and ONLY the meta-data of two folders, and that on two different machines independently is - in my understanding - low to zero.

- ZFS has never reported any error. As far as I understood, scrub would try to repair all the data and report errors if it reads wrong data from RAM - this has never happed, and the data was still there (until Freenas crashed completely).

So, do you think, this could still be a RAM-issue?

So again:
My first intention is to warn anyone who is using replication: Regardless, where this error is coming from, it can destroy all data in one single moment, which is quite hard when you trust ZFS.
My second intention is: Finding the error! Does anyone have an idea, where this error can come from? Could it be, that "zfs receive" corrupts a dataset, when it is interupted during replication (because of broken stream, or because of "invalid stream" (incomplete source file))? Why does ZFS NOT detect these obvious file-system errors during a scrub? Are there more possibilities to check the integrity of a file system (and repair them)?

Phew. I hope, anyone will read this (at least the important parts) and can help me in any way, although this post is so long. I just try to make the situation as clear as possible.

Best regards,

Oliver

cyberjock · Jan 12, 2014

First, if you are dumping to a file you are losing the ability to correct errors in the stream. Any errors in the stream will be reported, but they WILL be put in the pool. This is why files aren't the most preferred way to replicate. As I've never used files because I know about this potential limitation and so I don't use them, I cannot tell you how the final product will look. I don't know if it ignores the errors and continues or what it actually does.

Second, You might need more RAM than you have. 8GB is the minimum.

Third, if the system is spontaneously rebooting without a kernel panic message or anything that usually means a hardware problem.

Fourth, RAM problems don't cause "quite some damage". It causes random damage to the pool. The longer your RAM problem exists, the more damage that will occur. It is true that for virtually all users we've had that had bad RAM, they didn't realize they had a problem until the damage was widespread and unrecoverable. But in theory, you could catch it your problems with only a single bit error in the pool. It's very unlikely, but it is possible.

Fifth, it looks like you are using ACLs. So that could be your problem(permissions). I don't even talk about permissions with people because its something that seems to always end up taking hours and hours. Either you get it or you don't. And I can say that plenty of people think they get it, but don't. Permissions may be the biggest cause of problems for users in the forum.

And lastly, you are choosing to use non-ECC RAM. It's explained very well that you are taking certain risks with choosing to use non-ECC RAM. Now, if you were using ECC RAM then you'd have peaked my interest. But, considering you are using non-ECC RAM it's virtually impossible to prove you don't have latent ZFS pool damage that's waiting to bite you in the butt. And even if you went out and bought ECC RAM right now, the damage that exists(if any actually does) is already there and not fixable. And if you want to destroy your pool and restore from the other pool via replication, you may potentially end up with damage because the other pool may have the same damage.

And now you see why this whole ECC vs non-ECC will grab you by the ankles and have its way with you... forever and without any lubrication. You either go ECC from day one and be content in knowing your pool is very stable. Or you go with non-ECC RAM and realize that you are forever running with a pool that may have gremlins in hiding that may come out at any time.

Dusan · Jan 14, 2014

cyberjock said:
First, if you are dumping to a file you are losing the ability to correct errors in the stream. Any errors in the stream will be reported, but they WILL be put in the pool. This is why files aren't the most preferred way to replicate. As I've never used files because I know about this potential limitation and so I don't use them, I cannot tell you how the final product will look. I don't know if it ignores the errors and continues or what it actually does.

You don't lose anything. zfs send writes the replication stream to the standard output, zfs receive reads it from the standard input. It doesn't know and doesn't care how you transport the stream. You can pipe zfs send directly to zfs receive, you can send it via netcat, via ssh, stored it in a file, split it and transfer it in a truck full of 8inch floppy disks, ... It doesn't matter. zfs receive behaves always exactly the same -- there's no special code that would detect the stream was transferred via a file (and not via Unix pipe or ssh tunnel) and handle that one case differently.
Edit: you don't have the the option to correct errors in the stream even if you use a pipe or a ssh tunnel. zfs receive doesn't communicate back with zfs send so it can't request a resend if a corruption is detected -- it just fails.

titan_rw · Jan 14, 2014

The only time files are not a good idea for replication is when the file is stored in a non zfs location.

I've done "zfs send tank@snapshot > /mnt/zpool/tank.zfs" before. As Dusan points out, it doesn't matter.

But if you take that .zfs file and store it on a ufs / ext / ntfs partition or something, then the file is no longer protected by zfs checksum / correction. If the file suffers from bitrot, it WILL be detected by the subsequent "zfs receive", but being that it can't correct it, the zfs receive will simply fail, and you'll be back to square one. Not good when one wrong bit can make your entire replication snapshot archive useless.

If I really had to store zfs replication files on non zfs locations, I'd run par2 against it with like 1% parity incase there's bitrot, or unreadable sectors later on.

2607:fca8:2636:8001::4 · Jan 16, 2014

I just want to confirm what DaVo has posted. The receiving side data is FUBAR!! I just completed a bunch of large 150G+ to 3TB, at first glace it appears that everything had transferred, but after CAREFULLY looking at the data/folders, stuff is missing, corrupted, and does not match the sending side. I've checked the sending side and the same data is fine. I ran a "Dry Run" (rsync -an) on the source/destination and the destination is FUBAR.

Folks, if you don't want to lose your data, do your own tests and confirm things on the destination.

DaVo, Thanks for pointing this out. I was just in the process of rebuilding my pool(s) and decided to triple check everything replicated cleanly, and it hadn't. I'm using 8.31 x64 with 2 mirrors, 2x 3TB and 2x 4TB. I also tried zfs send piped to netcat, everything is local on the same switch right next to each other.

There is definitely a problem somewhere.

Sir.Robin · Jan 17, 2014

Are u using rsync or zfs replication?

DaVo · Jan 29, 2014

After the very warm welcome, you see me back after a couple of weeks.

I have completely rebuilt both servers. Both are now running ECC-RAM, both are on 9.2 Release now, and both are completely wiped and filled from an external Backup - so no replication, but completely fresh ZFS-Volume with datasets and then copied over. I compared EVERYTHING after each copy (BeyondCompare binary-compare) (thats one reason, why I'm replying so late copying over a couple of TB and then comparing them took a couple of days....) So here is what I did:
1. Copied everything from Server1 to the backup (external ZFSz1)
2. Binary compare (byte by byte) between original and copy
3. Rebuilt servers from scratch (WITH ECC)
4. Added ZFSVolumes and introduced ZFS Snapshots
5. Copied back everything from the backup to the ZFSVolume
6. Binary-compared everything again
6. Dumped the resulting snapshots to a ZFS Dataset
7. Copied the ZFS-Snapshot-dumps to an external drive (non-ZFS)
8. Binary-compared the snapshot-dump-files
9. Copied the snapshot-dump-files to the Server2
10. Binary-compared the snapshot-dump-files again
11. Compared the snapshot-dump-files via direct SSH-connection and rsync MD5
12. zfs-receive from snapshot-dump-file to fresh dataset

BTW: tar has the same bug, than it had in Linux before: The "change date" of folders are NOT set correctly during un-tar. That made it necessary to wipe the whole pool again and copy everything new with rsync.

And now guess what: The problem still exists!

But now step by step:

Thank you for your replies at first!

The first reply was from cyberjock. Here my comments:

To first:
In my understanding, that wouldn't make sense at all - and is definitively different from how ZFS is advertised. Do you have any reference for the statement that ZFS is unsafe if a stream is received from a file?

To second:
As long as you don't say that too small RAM makes ZFS destroying data sets, this should not be the topic here (and I'm quite satisfied with the performance so far). But to make it complete, I also added 8GB of RAM while buying whole-new ECC-RAM anyway.

To third:
I think there was some kind of kernel panic - otherwise, it wouldn't have filled tons of screens before rebooting, would it? But I couldn't find anything in the logs of FreeNAS. If this happens again: Where would I find this coredump?

To fourth:
As you already said: Very unlikely. Again: I copied multiple TBs between the pools forth and back; I read EVERYTHING a couple of times (multiple scrubs; multiple binary compares). And you still think, that a RANDOM bit error would only and only affect the meta data of the data set. And this on two completely independent machines? Come on...
But since now the error persists WITH ECC, I hope we can close this memory discussion...

To fifth:
You really mean, there is a possibility to deny the access to a file for system and root with ACLs? And in this case sometimes yes and sometimes not (see my first post)? I mean: I would not say that I'm the ACL-expert, but so far, they're exactly doing what I want. But also so far, I didn't know, that with (possibly) wrong set ACL, I can make ZFS and freenas crash??!!

To the sixth/last:
...ok, I think this topic is done (see comments on "first").

To Dusan and titan-rw:

Thank you for setting this right. I already thought I'm going nuts. But to make this 100% clear again:
IF a replication/zfs receive fails (because of a biterror in a received stream or a brokine pipe during SSH-replication or anything else), what exactly happens? Is the receive copy-on-write as well, and only if the receive runs successfully to the end, the changes are made to the pool? I couldn't find anything about this (although, according to the advertisement of Oracle, FreeBSD and so on I would assume, that ZFS would NOT crash just because a however failed receive).

And finally to 2607:fca8:2636:8001::4 and Sir.Robin:

I am using zfs send/receive - so the replication kind. And I can definitively reproduce this error...

So, (as long as I don't know better) my warning persists! Anything here is definitively wrong - and anyone who is using ZFS replication should have a very close look to the resulting data!

Any further ideas, what I could change/try/do, to make ZFS doing what it is made for (storing data safely)?

Dusan · Jan 30, 2014

DaVo said:
Thank you for setting this right. I already thought I'm going nuts. But to make this 100% clear again:
IF a replication/zfs receive fails (because of a biterror in a received stream or a brokine pipe during SSH-replication or anything else), what exactly happens? Is the receive copy-on-write as well, and only if the receive runs successfully to the end, the changes are made to the pool? I couldn't find anything about this (although, according to the advertisement of Oracle, FreeBSD and so on I would assume, that ZFS would NOT crash just because a however failed receive).

If the receive fails it will cleanup all changes and no new snapshot will be created on the destination (https://github.com/trueos/trueos/bl.../uts/common/fs/zfs/dmu_send.c?source=cc#L1674).

So, (as long as I don't know better) my warning persists! Anything here is definitively wrong - and anyone who is using ZFS replication should have a very close look to the resulting data!

I do use replication for my nightly backups. I did a byte-by-byte comparison of my main pool & backup and everything matches. You may want to create a bug report here: https://bugs.freenas.org/projects/freenas/issues

Sir.Robin · Jan 30, 2014

Dusan said:
I do use replication for my nightly backups. I did a byte-by-byte comparison of my main pool & backup and everything matches. You may want to create a bug report here: https://bugs.freenas.org/projects/freenas/issues

How do you perform a byte-by-byte comparison?

DaVo · Jan 30, 2014

The Problem is, that only the Meta-data seems to be affected. If I do a byte-by-byte compare on the Freenas-machine as root, nothing will be detected. The data is still there and acessible!! Only if I try to access the data as a different user, or try to change the meta-data, Freenas crashes. So a byte-by-byte compare as root on the same machine will NOT show the error!

Dusan · Jan 30, 2014

DaVo said:
To third:
I think there was some kind of kernel panic - otherwise, it wouldn't have filled tons of screens before rebooting, would it? But I couldn't find anything in the logs of FreeNAS. If this happens again: Where would I find this coredump?

FreeNAS saves crash dumps in /data/crash.

Dusan · Jan 30, 2014

Sir.Robin said:
How do you perform a byte-by-byte comparison?

diff -qr <directory_1> <directory_2>

DaVo said:
The Problem is, that only the Meta-data seems to be affected. If I do a byte-by-byte compare on the Freenas-machine as root, nothing will be detected. The data is still there and acessible!! Only if I try to access the data as a different user, or try to change the meta-data, Freenas crashes. So a byte-by-byte compare as root on the same machine will NOT show the error!

I see in your original post that you had problems even with root. I'm not able to reproduce that. I'm also able to access the data with different users.
Can you try to come up with a minimal scenario? Steps anyone can use to reproduce the issue?

DaVo · Jan 30, 2014

Thank you for your replies.

I will travel this weekend and will not have the time to test a lot until I return in two weeks. But a first quick answer is:

The error so far only seems to happen, if the snapshot is large (>20GB), and is dumped to a file before. With this setup, I can reproduce the error. I will try again on the weekend with another server. At the moment, I don't have the possibility to test this without dumping to a file, since ZFS receive still does not support resume - and so far I am not able to transfer >20GB over WAN without a single error/broken pipe (the second server is off-site).

Maybe, this helps already (especially, since many people won't have that large snapshots in all-day productive environment. They will most likely occur during first-fill of the server).

cyberjock · Feb 4, 2014

DaVo: thanks for sticking with this. I've been interested to see what the actual problem is. Initially there were way too many variables to narrow down the problem. Can you post your actual hardware specs with the new servers?

Honestly, I'm somewhat expecting this error to be user error or a misunderstanding at the user level somewhere. Not that I'm blaming you, but I'd think that if this were as serious as it appears on the surface surely others would have noticed this by now. The ultimate goal is to ensure there isn't some nasty bug here. And if there is, fix it.

I may have to do a little experimenting with this over the coming weekend and see if I can reproduce the problem.

As for your 8GB of RAM, there's plenty of evidence that not enough can be very destructive. We've seen many pools go to "the void" because of insufficient RAM. The 8GB of RAM minimum in the manual was added/updated by me because we were seeing pools fail with less. I take the stance that minimum requirements as posted in manuals, FAQs etc should never be responsible for data loss. They could cause a machine to perform poorly. But data loss should NEVER be permitted. If it is happening(and it was with 9.1.0 and 9.1.1 when the requirement was 6GB per the manual) then either you fix the manual or you fix the code. Guess which one was easier? Also, 8GB of RAM isn't exactly a lot of RAM by today's standards. So we're not exactly asking for someone to spend a fortune in RAM, and having 2GB more RAM will always make a pool faster.

Not sure if you feel like doing some Teamviewer/SSH, but I would like to take a look at what you are doing and see what exactly you are doing so I can ensure what I am trying to replicate matches what your problem is. I'm in the USA, and we can chat either on the cell or via skype if you are interested. Just send me a PM.

dangil · Apr 25, 2014

I am having a similar issue.

http://forums.freenas.org/index.php?threads/fatal-trap-12-when-copying-files-with-acl.20452/

except I am not replicating via zfs send/receive. I am copying files and ACLs from a windows server via robocopy to a CIFS share

If I only copy files , without ACLs, everything works great.

I guess it's something with ACLs on the filesystem that cause the primary kernel panic, and that after a reboot the pool is hosed forever.

jkh · Apr 25, 2014

DaVo said:
I have a huge Problem with replicating ZFS datasets on a remote server. So my intention of this (long) post is: Warn any ZFS-replication users AND (hopefully) get some help in what I could do. [ ... ]
On each server, there is a local dataset wich is locally used as normal working set (Server 1 4TB; Server 2 2TB). Each dataset has a snapshot task. The resulting snapshots are dumped incrementally to files, which are then transferred to the remote server (herefore, the autorepl.py-script has been modified, so the commands stayed the same with the only difference, that the zfs send does not send directly over SSH, but to a file first, and the zfs receive does not read from an SSH-stream, but from the according file.

I have no idea what problem you're having here since ZFS replication works corruption-free for many many people, including many paying customers (of TrueNAS and FreeNAS), but just to note one thing here: You modified autorepl.py. Please don't file a bug report against this if you made local modifications, since we have no idea what you did or if you did it correctly. Future versions of FreeNAS will, in fact, keep checksums for all installed files and if clicking the "Verify installation" button (that we'll also be adding) returns "Installation is corrupted", we'll tell you to reinstall before filing any bug reports with us, since that's just the way it has to be - we can't support what we did not create, and I can only suggest that there's something weird with the way you're using the intermediate file and/or how you're transferring it to the remote machine. Stick with the existing, non-modified, replication mechanism would be my best advice!

DaVo · Jun 14, 2014

Finally, I'm back again. Sorry, for replying so late (have been around the world and back...).

So, I did some tests in the mean time and can affirm what dangil said: There is definitively a problem with the ACL-Implementation.

BTW: I even experienced this "Fatal Trap 12"-Issue when working with ACL directly (like setting and removing ACLs just for testing. I added 10 group-ACLs to a directory and then removed the entries with setfacl -x 0 again - but after the 6th or so, Freenas chrashed with "Fatal Trap 12".

@jkh: In the mean time I stripped down my test configuration to only send/receive - not more, not less. So either I use a brand new Freenas-Release-Replication or I just do a zfs send|receive on my own: I can always reproduce this error. So it has nothing to do with any modifications I did.
And it has also nothing to do with any network, ssh or whatever in between. I can even reproduce this error when I replicate within the same machine.

But let's have a look at my test reports:

Test 1 - Send snapshots from productive System to a temp-ZFS-Volume via file:

Here are my snapshots:

Code:

ZFSVol/localDS/ultrasafeDS@auto-20140126.1853-10y      192K    304K
ZFSVol/localDS/ultrasafeDS@auto-20140126.1908-10y      687K    30.6G
ZFSVol/localDS/ultrasafeDS@auto-20140126.1923-10y      432K    90.0G
ZFSVol/localDS/ultrasafeDS@auto-20140126.1938-10y      751K    148G
ZFSVol/localDS/ultrasafeDS@auto-20140126.1953-10y      847K    210G
ZFSVol/localDS/ultrasafeDS@auto-20140126.2008-10y      1.34M  267G
ZFSVol/localDS/ultrasafeDS@auto-20140126.2023-10y      4.73M  319G
ZFSVol/localDS/ultrasafeDS@auto-20140126.2038-10y      895K    359G
ZFSVol/localDS/ultrasafeDS@auto-20140126.2053-10y      1.05M  401G
ZFSVol/localDS/ultrasafeDS@auto-20140126.2108-10y      1.09M  442G

Send to file:

Code:

zfs send -V -R ZFSVol/localDS/ultrasafeDS@auto-20140126.1853-10y > /mnt/ZFSVol/replDS/local/snap_temp/ZFSVol/localDS/ultrasafeDS@auto-20140126.1853-10y
zfs send -V -R -I ZFSVol/localDS/ultrasafeDS@auto-20140126.1853-10y ZFSVol/localDS/ultrasafeDS@auto-20140126.1908-10y > /mnt/ZFSVol/replDS/local/snap_temp/ZFSVol/localDS/ultrasafeDS@auto-20140126.1908-10y
zfs send -V -R -I ZFSVol/localDS/ultrasafeDS@auto-20140126.1908-10y ZFSVol/localDS/ultrasafeDS@auto-20140126.1923-10y > /mnt/ZFSVol/replDS/local/snap_temp/ZFSVol/localDS/ultrasafeDS@auto-20140126.1923-10y
zfs send -V -R -I ZFSVol/localDS/ultrasafeDS@auto-20140126.1923-10y ZFSVol/localDS/ultrasafeDS@auto-20140126.1938-10y > /mnt/ZFSVol/replDS/local/snap_temp/ZFSVol/localDS/ultrasafeDS@auto-20140126.1938-10y
zfs send -V -R -I ZFSVol/localDS/ultrasafeDS@auto-20140126.1938-10y ZFSVol/localDS/ultrasafeDS@auto-20140126.1953-10y > /mnt/ZFSVol/replDS/local/snap_temp/ZFSVol/localDS/ultrasafeDS@auto-20140126.1953-10y

Code:

[root@freenas] /mnt/ZFSVol/replDS/local/snap_temp/ZFSVol/localDS# ll
total 220256779
drwxr-xr-x  2 root  admin_grp            7 Mar 10 18:57 ./
drwxr-xr-x  3 root  admin_grp            3 Jan 26 18:53 ../
-rw-r--r--  1 root  admin_grp      659344 Mar 10 17:57 ultrasafeDS@auto-20140126.1853-10y
-rw-r--r--  1 root  admin_grp  32766275336 Mar 10 18:03 ultrasafeDS@auto-20140126.1908-10y
-rw-r--r--  1 root  admin_grp  63834220916 Mar 10 18:19 ultrasafeDS@auto-20140126.1923-10y
-rw-r--r--  1 root  admin_grp  62325577372 Mar 10 18:31 ultrasafeDS@auto-20140126.1938-10y
-rw-r--r--  1 root  admin_grp  66392619836 Mar 10 18:52 ultrasafeDS@auto-20140126.1953-10y

Receive from file:

Code:

zfs receive -F -d ZFSTest < /mnt/ZFSVol/replDS/local/snap_temp/ZFSVol/localDS/ultrasafeDS@auto-20140126.1853-10y
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS# ll
total 25
drwxrwx---+ 2 admin  admin_grp  2 Jan 26 18:44 ./
drwxr-xr-x  3 root  wheel      3 Mar 10 19:02 ../
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS# zfs receive -F -d ZFSTest < /mnt/ZFSVol/replDS/local/snap_temp/ZFSVol/localDS/ultrasafeDS@auto-20140126.1908-10y
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS# ll
total 75
drwxrwx---+  5 admin  admin_grp  5 Jan 26 18:24 ./
drwxr-xr-x  3 root  wheel      3 Mar 10 19:02 ../
drwxrwx---+  5 admin  admin_grp  7 Jan  7 11:55 Dir1/
drwxrwx---+ 11 admin  admin_grp  15 Jan 26 19:04 Oli/
drwxrwx---+  2 root  admin_grp  2 Jan 26 19:00 Share/
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS# cd Oli
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS/Oli# ll
total 22634172
drwxrwx---+ 11 admin  admin_grp          15 Jan 26 19:04 ./
drwxrwx---+  5 admin  admin_grp            5 Jan 26 18:24 ../
-rwx------+  1 root  admin_grp  19484049408 Jan 26 19:08 .Regelmaessig.bkf.AhknBh*
-rwxrwx---+  1 admin  admin_grp        57970 May 31  2013 AU.pdf*
drwxrwx---+ 10 admin  admin_grp          10 Jan 26 19:03 Bilder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Camcorder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 ExtStorBackup/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Fotos sortiert/
-rwxrwx---+  1 admin  admin_grp    113874944 Oct 30  2011 FreeNAS-8.0.2-RELEASE-amd64.iso*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 HD2Backups/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Privat/
-rwxrwx---+  1 admin  admin_grp  4063625216 Apr  5  2008 Privat ohne Bilder 2008-04-05.bkf*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Produktinfos/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 USB/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 wakeup/
 
zfs receive -F -d ZFSTest < /mnt/ZFSVol/replDS/local/snap_temp/ZFSVol/localDS/ultrasafeDS@auto-20140126.1923-10y
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS# ll
total 67
drwxrwx---+  5 admin  admin_grp  5 Jan 26 18:24 ./
drwxr-xr-x  3 root  wheel      3 Mar 10 19:02 ../
drwxrwx---+  5 admin  admin_grp  7 Jan  7 11:55 Dir1/
ls: ./Oli: No such file or directory
drwxrwx---  11 admin  admin_grp  22 Dec 20 23:08 Oli/
drwxrwx---+  2 root  admin_grp  2 Jan 26 19:00 Share/
 
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS/Oli# ll
total 69447055
ls: ./.: No such file or directory
drwxrwx---  11 admin  admin_grp          22 Dec 20 23:08 ./
drwxrwx---+  5 admin  admin_grp            5 Jan 26 18:24 ../
-rwxrwx---+  1 admin  admin_grp        57970 May 31  2013 AU.pdf*
drwxrwx---+ 10 admin  admin_grp          10 Jun  3  2013 Bilder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Camcorder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 ExtStorBackup/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Fotos sortiert/
-rwxrwx---+  1 admin  admin_grp    113874944 Oct 30  2011 FreeNAS-8.0.2-RELEASE-amd64.iso*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 HD2Backups/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Privat/
-rwxrwx---+  1 admin  admin_grp  4063625216 Apr  5  2008 Privat ohne Bilder 2008-04-05.bkf*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Produktinfos/
-rwxrwx---+  1 admin  admin_grp  60296730624 Dec 13  2011 Regelmaessig.bkf*
-rwxrwx---+  1 admin  admin_grp  4854925312 Apr  5  2008 Studium 2008-04-05.bkf*
-rwxrwx---+  1 admin  admin_grp  3112250368 Mar  9  2006 Tocra 2006-03-09.bkf*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 USB/
-rwxrwx---+  1 admin  admin_grp        32784 May 31  2013 VHS Bescheinigung.pdf*
-rwxrwx---+  1 admin  admin_grp        36412 Feb  2  2013 autoshutdown.sh*
-rwxrwx---+  1 admin  admin_grp        7805 Dec 12  2011 autoshutdown.zip*
-rwxrwx---+  1 admin  admin_grp          92 Feb  2  2013 command.txt*
-rwxrwx---+  1 admin  admin_grp        1508 Dec 14  2011 prepare_autoshutdown.sh*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 wakeup/

Result: Broken meta data!

Test 2 - Send snapshots from productive System to a temp-ZFS-Volume directly:

Reset Test-Environment:

Detach ZFS-Volume (ZFSTest) and mark Disks as new
Create new ZFSVol (ZFSTest)

Send snapshots to temp ZFSTest volume and check result after every snapshot:

Code:

zfs send -V -R ZFSVol/localDS/ultrasafeDS@auto-20140126.1853-10y | zfs receive -F -d ZFSTest
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS# ll
total 25
drwxrwx---+ 2 admin  admin_grp  2 Jan 26 18:44 ./
drwxr-xr-x  3 root  wheel      3 Mar 10 19:24 ../
 
zfs send -V -R -I ZFSVol/localDS/ultrasafeDS@auto-20140126.1853-10y ZFSVol/localDS/ultrasafeDS@auto-20140126.1908-10y | zfs receive -F -d ZFSTest
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS# ll
total 75
drwxrwx---+  5 admin  admin_grp  5 Jan 26 18:24 ./
drwxr-xr-x  3 root  wheel      3 Mar 10 19:24 ../
drwxrwx---+  5 admin  admin_grp  7 Jan  7 11:55 Dir1/
drwxrwx---+ 11 admin  admin_grp  15 Jan 26 19:04 Oli/
drwxrwx---+  2 root  admin_grp  2 Jan 26 19:00 Share/
 
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS/Oli# ll
total 22634172
drwxrwx---+ 11 admin  admin_grp          15 Jan 26 19:04 ./
drwxrwx---+  5 admin  admin_grp            5 Jan 26 18:24 ../
-rwx------+  1 root  admin_grp  19484049408 Jan 26 19:08 .Regelmaessig.bkf.AhknBh*
-rwxrwx---+  1 admin  admin_grp        57970 May 31  2013 AU.pdf*
drwxrwx---+ 10 admin  admin_grp          10 Jan 26 19:03 Bilder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Camcorder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 ExtStorBackup/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Fotos sortiert/
-rwxrwx---+  1 admin  admin_grp    113874944 Oct 30  2011 FreeNAS-8.0.2-RELEASE-amd64.iso*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 HD2Backups/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Privat/
-rwxrwx---+  1 admin  admin_grp  4063625216 Apr  5  2008 Privat ohne Bilder 2008-04-05.bkf*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Produktinfos/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 USB/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 wakeup/
 
zfs send -V -R -I ZFSVol/localDS/ultrasafeDS@auto-20140126.1908-10y ZFSVol/localDS/ultrasafeDS@auto-20140126.1923-10y | zfs receive -F -d ZFSTest
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS# ll
total 67
drwxrwx---+  5 admin  admin_grp  5 Jan 26 18:24 ./
drwxr-xr-x  3 root  wheel      3 Mar 10 19:24 ../
drwxrwx---+  5 admin  admin_grp  7 Jan  7 11:55 Dir1/
ls: ./Oli: No such file or directory
drwxrwx---  11 admin  admin_grp  22 Dec 20 23:08 Oli/
drwxrwx---+  2 root  admin_grp  2 Jan 26 19:00 Share/
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS/Oli# ll
total 69447055
ls: ./.: No such file or directory
drwxrwx---  11 admin  admin_grp          22 Dec 20 23:08 ./
drwxrwx---+  5 admin  admin_grp            5 Jan 26 18:24 ../
-rwxrwx---+  1 admin  admin_grp        57970 May 31  2013.pdf*
drwxrwx---+ 10 admin  admin_grp          10 Jun  3  2013 Bilder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Camcorder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 ExtStorBackup/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Fotos sortiert/
-rwxrwx---+  1 admin  admin_grp    113874944 Oct 30  2011 FreeNAS-8.0.2-RELEASE-amd64.iso*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 HD2Backups/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Privat/
-rwxrwx---+  1 admin  admin_grp  4063625216 Apr  5  2008 Privat ohne Bilder 2008-04-05.bkf*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Produktinfos/
-rwxrwx---+  1 admin  admin_grp  60296730624 Dec 13  2011 Regelmaessig.bkf*
-rwxrwx---+  1 admin  admin_grp  4854925312 Apr  5  2008 Studium 2008-04-05.bkf*
-rwxrwx---+  1 admin  admin_grp  3112250368 Mar  9  2006 Tocra 2006-03-09.bkf*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 USB/
-rwxrwx---+  1 admin  admin_grp        32784 May 31  2013 VHS Bescheinigung.pdf*
-rwxrwx---+  1 admin  admin_grp        36412 Feb  2  2013 autoshutdown.sh*
-rwxrwx---+  1 admin  admin_grp        7805 Dec 12  2011 autoshutdown.zip*
-rwxrwx---+  1 admin  admin_grp          92 Feb  2  2013 command.txt*
-rwxrwx---+  1 admin  admin_grp        1508 Dec 14  2011 prepare_autoshutdown.sh*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 wakeup/

Result: After 90G transfered, the volume is broken. For being sure, that it has nothing to do with any in-process-being copies, I replicated the next snapshot as well:

Code:

[root@freenas] /mnt/ZFSTest# zfs send -V -R ZFSVol/localDS/ultrasafeDS@auto-20140126.1923-10y | zfs receive -F -d ZFSTest
 
 
[root@freenas] /mnt/ZFSTest/localDS/ultrasafeDS# ll
total 67
drwxrwx---+  5 admin  admin_grp  5 Jan 26 18:24 ./
drwxr-xr-x  3 root  wheel      3 Mar 10 20:13 ../
drwxrwx---+  5 admin  admin_grp  7 Jan  7 11:55 Dir1/
ls: ./Oli: No such file or directory
drwxrwx---  11 admin  admin_grp  22 Dec 20 23:08 Oli/
drwxrwx---+  2 root  admin_grp  2 Jan 26 19:00 Share/

For Comparision, here the clone of 1923er Snapshot:

Code:

[root@freenas] /mnt/ZFSVol/localDS/ultrasafeDS/clone-auto-20140126.1923-10y# ll
total 163
drwxrwx---+  5 admin  admin_grp  5 Jan 26 18:24 ./
drwxrwx---+  6 admin  admin_grp  6 Mar 11 17:13 ../
drwxrwx---+  5 admin  admin_grp  7 Jan  7 11:55 Dir1/
drwxrwx---+ 11 admin  admin_grp  22 Dec 20 23:08 Oli/
drwxrwx---+  2 root  admin_grp  2 Jan 26 19:00 Share/
 
 
[root@freenas] /mnt/ZFSVol/localDS/ultrasafeDS/clone-auto-20140126.1923-10y/Oli# ll
total 69448301
drwxrwx---+ 11 admin  admin_grp          22 Dec 20 23:08 ./
drwxrwx---+  5 admin  admin_grp            5 Jan 26 18:24 ../
-rwxrwx---+  1 admin  admin_grp        57970 May 31  2013 AU.pdf*
drwxrwx---+ 10 admin  admin_grp          10 Jun  3  2013 Bilder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Camcorder/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 ExtStorBackup/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Fotos sortiert/
-rwxrwx---+  1 admin  admin_grp    113874944 Oct 30  2011 FreeNAS-8.0.2-RELEASE-amd64.iso*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 HD2Backups/
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Privat/
-rwxrwx---+  1 admin  admin_grp  4063625216 Apr  5  2008 Privat ohne Bilder 2008-04-05.bkf*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 Produktinfos/
-rwxrwx---+  1 admin  admin_grp  60296730624 Dec 13  2011 Regelmaessig.bkf*
-rwxrwx---+  1 admin  admin_grp  4854925312 Apr  5  2008 Studium 2008-04-05.bkf*
-rwxrwx---+  1 admin  admin_grp  3112250368 Mar  9  2006 Tocra 2006-03-09.bkf*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 USB/
-rwxrwx---+  1 admin  admin_grp        32784 May 31  2013 VHS Bescheinigung.pdf*
-rwxrwx---+  1 admin  admin_grp        36412 Feb  2  2013 autoshutdown.sh*
-rwxrwx---+  1 admin  admin_grp        7805 Dec 12  2011 autoshutdown.zip*
-rwxrwx---+  1 admin  admin_grp          92 Feb  2  2013 command.txt*
-rwxrwx---+  1 admin  admin_grp        1508 Dec 14  2011 prepare_autoshutdown.sh*
drwxrwx---+  2 root  admin_grp            2 Jan 26 19:03 wakeup/

No Errors at all at the source data set...

Test 3 - Synthetically setup volumes for reproduction test: Send snapshots from freshly set-up System to a temp-ZFS-Volume (Freenas 9.2.1.5 RELEASE):

In the meantime, I set up a new machine with two independent ZFS-Volumes (independant to the upper (productive) ZFSVol, to preclude the possibility that my Main-Dataset is broken (although ZFS never reported anything like this).

So here is my setup:

1. Create new ZFSVolume (WD02) and create an automatic Snapshot task every 15 minutes
2. Create one folder with ACLs and inheritance (fd) for 18 groups (my target number of groups)
3. Copy arbitrary files to it (tar cf - src | ssh -c arcfour 10.0.1.250 "tar xf - -C dst" (without copying any permissions)

Result:

Code:

WD02/localDS@auto-20140614.0115-2y      0      -      152K    -
WD02/localDS@auto-20140614.0130-2y      0      -      152K    -
WD02/localDS@auto-20140614.0145-2y      88K    -      160K    -
WD02/localDS@auto-20140614.0200-2y      88K    -      152K    -
WD02/localDS@auto-20140614.0215-2y      312K    -      24.8G  -
WD02/localDS@auto-20140614.0231-2y      704K    -      64.4G  -
WD02/localDS@auto-20140614.0246-2y      688K    -      102G    -
WD02/localDS@auto-20140614.0301-2y      288K    -      144G    -
WD02/localDS@auto-20140614.0316-2y      280K    -      179G    -
WD02/localDS@auto-20140614.0331-2y      328K    -      219G    -
WD02/localDS@auto-20140614.0346-2y      280K    -      258G    -
WD02/localDS@auto-20140614.0401-2y      232K    -      293G    -
WD02/localDS@auto-20140614.0416-2y      320K    -      327G    -

...

4. Check meta data for correctness (or at least not for total destruction)

Code:

[root@freenas] /mnt/WD02/localDS/ultrasafe/Oli# find * | getfacl | grep 'no such'
[root@freenas] /mnt/WD02/localDS/ultrasafe/Oli#

-> Everything seems to be fine

5. Create another ZFSVolume (ZFSTest)
6. zfs send|receive the snapshots from WD02 to ZFSTest

Code:

zfs send -V -R WD02/localDS@auto-20140614.0115-2y | zfs receive -F -d ZFSTest
zfs send -V -R -I WD02/localDS@auto-20140614.0115-2y WD02/localDS@auto-20140614.0416-2y | zfs receive -F -d ZFSTest

Result:

Code:

[root@freenas] /mnt/ZFSTest/localDS/ultrasafe/Oli# find * | getfacl | grep 'no such'
getfacl: Camcorder/2012-09-13/PRIVATE/AVCHD/BDMV/STREAM: No such file or directory
getfacl: Camcorder/2012-12-16/PRIVATE/AVCHD/BDMV/STREAM: No such file or directory
getfacl: Camcorder/2012-08-12/64_1/PRIVATE/AVCHD/BDMV/CLIPINF: No such file or directory
getfacl: Camcorder/2012-08-12/64_1/PRIVATE/AVCHD/BDMV/STREAM: No such file or directory
getfacl: Camcorder/2012-08-12/32_1/PRIVATE/AVCHD/BDMV/CLIPINF: No such file or directory
getfacl: Camcorder/2013-04-17/PRIVATE/AVCHD/BDMV/STREAM: No such file or directory
getfacl: Camcorder/2013-04-17/PRIVATE/AVCHD/BDMV/CLIPINF: No such file or directory
getfacl: ExtStorBackup/Privat: No such file or directory
getfacl: ExtStorBackup/CD Images: No such file or directory
getfacl: ExtStorBackup/Foto Freimachen: No such file or directory
getfacl: ExtStorBackup/dalarna: No such file or directory
[root@freenas] /mnt/ZFSTest/localDS/ultrasafe/Oli#

-> Randomly broken meta data in SUB(!)-directories.

And before you say again: That's because of ACL and only the ACL is broken:

Code:

[root@freenas] /mnt/ZFSTest/localDS/ultrasafe/Oli/ExtStorBackup# ls -l
total 25661498
drwxrwx---+  8 admin  admin_grp          12 Jun 14 03:45 ./
drwxrwx---+  6 root  admin_grp            7 Jun 14 03:37 ../
drwxrwx---+  4 admin  admin_grp            4 Jun 14 04:01 Antika/
-rwxrwx---+  1 admin  admin_grp  7722928128 Dec  3  2006 Backup 2006-12-03 (HHD Defekt).bkf*
ls: ./CD Images: No such file or directory
drwxrwx---  14 admin  admin_grp          14 Jun 14 03:54 CD Images/
drwxrwx---+  4 admin  admin_grp            4 Jun 14 04:01 DVD/
-rwxrwx---+  1 admin  admin_grp  4697618944 Jul 31  2012 DVD03.iso*
ls: ./Foto Freimachen: No such file or directory
drwxrwx---  2 admin  admin_grp          39 Jun 14 03:54 Foto Freimachen/
ls: ./Privat: No such file or directory
drwxrwx---  14 admin  admin_grp          27 Jun 14 03:47 Privat/
-rwxrwx---+  1 admin  admin_grp  9008162816 Dec  3  2006 Privat 2006-12-03 (HHD Defekt).bkf*
-rwxrwx---+  1 admin  admin_grp  13598425088 Apr  4  2008 System 2008-04-04.bkf*
ls: ./dalarna: No such file or directory
drwxrwx---  2 admin  admin_grp          29 Jun 14 03:47 dalarna/

Nope - the general meta data of the folders are broken as well - even ls with root permissions cannot access it anymore (altough I can still cd to that dir).
And since I also have no possibility to set the standard unix permissions anymore:

Code:

[root@freenas] /mnt/ZFSTest/localDS/ultrasafe/Oli/ExtStorBackup# chmod a+rwx dalarna/
chmod: dalarna/: No such file or directory

The files are lost since nobody except root will ever have access again.

Conclusion:

> The bug seems to be not a direct FreeNAS-bug, but a FreeBSD bug (which of course still hits FreeNAS-Users using ACLs)
> The error affects only datasets which are working with ACLs
> The error still persists in FreeNAS 9.2.1.5 and destroys data
> As long as you don't check the meta data of the directory or make a byte-by-byte comparision as non-root, the error might destroy your data without you taking notice

Action Items:
I will change the headline of my thread to make clear, this only affects ACL-based systems.
Update: I don't know how I could do this or whether I have the permission to.
I will post a bug report regarding this issue.
Update: Done: https://bugs.freenas.org/issues/5225

Best regards

Oliver

Sir.Robin · Jun 19, 2014

I have been replicating every night for the last two years several datasaets and over internet to secondary site. Also i've done this locally (manual zfs send/receive) when building a new pool (new drives) several times.

My snapshot tasks are only 1 per night. They are usually small.

I also once deleted local dataset and replicated it back from the secondary.

I am using windows ACL's on some datasets.

I have not yet seen any issues like this.

I just finished checking both sides and get no hits on "no such".

This scares me a bit though... so i hope it gets solved or why this happens to you.

panz · Jun 21, 2014

Sir.Robin said:
I have been replicating every night for the last two years several datasaets and over internet to secondary site. Also i've done this locally (manual zfs send/receive) when building a new pool (new drives) several times.

(...)
This scares me a bit though... so i hope it gets solved or why this happens to you.

I have this BIG problem (Replication killed all my data on the receiving side). Please take a look at the bug I posted

https://bugs.freenas.org/issues/5293

:(

Sir.Robin · Jun 23, 2014

Could this have something to do with encrypted pools? And also... you say you insert the backup pool. So i assume you remove it and have it offline until you wan't a new snapshot replicated?

Important Announcement for the TrueNAS Community.

ZFS replication corrupts entire ZFS volume (Warning!)

Cadet

Inactive Account

Guru

Guru

Cadet

Guru

Cadet

Guru

Guru

Cadet

Guru

Guru

Cadet

Inactive Account

Cadet

jkh

Guest

Cadet

Guru

Guru

Guru

Similar threads