Hi all,
I have a huge Problem with replicating ZFS datasets on a remote server. So my intention of this (long) post is: Warn any ZFS-replication users AND (hopefully) get some help in what I could do.
Here are the Specs first:
Two Servers, both having AMD dual core 64 Bit CPUs
Both have 8GB RAM (non-ECC)
Both have 6x 2TB Drives installed and running (Raidz2 resulting in around 6TB of usable diskspace)
Both are located in different offices, connnected via VPN.
The configuration is as follows:
On each server, there is a local dataset wich is locally used as normal working set (Server 1 4TB; Server 2 2TB). Each dataset has a snapshot task. The resulting snapshots are dumped incrementally to files, which are then transferred to the remote server (herefore, the autorepl.py-script has been modified, so the commands stayed the same with the only difference, that the zfs send does not send directly over SSH, but to a file first, and the zfs receive does not read from an SSH-stream, but from the according file.
The files are (depending on the size) transferred via WAN and rsync or via external drive). This is due to the possibility of resuming a broken transfer if the WAN-connection breaks in between - but this has not been used so far. Every file so far was transferred directly (via SSH-tunnel or with a transfer-drive).
The Problem:
On a first look, the replications are working perfect on both sides. ZFS is not reporting any errors and the datasets and snapshots are showing up as expected.
BUT:
Having a closer look to the replicated file system, one can see, that the meta-data is corrupted on some more or less top-level folders (deeper in the tree, there are no known errors so far).
What's this? The folder, ls is currently listing is not existant (ls: ./.: No such file or directory)?
If I try to change the owner on these folders (only two of a couple of thousands), the behaviour is quite strange:
Hum - ok. The first chown worked without problem, the second one is complaining, but works as well.
As well, it is no Problem to access all the data in the folders:
Ok, I can CD to this folder, but it is not existant - but all of the sub-directories and data are.
So, everything more or less would be ok for the users (with the drawback, that these folders can not be accessed by any non-root-user, since the permission information is missing). But the backup is still there and can be used in case of emergency.
BUT (and here comes a warning to all ZFS-replication-users):
This ZFS-Volume is corrupted!! Altough ZFS is NOT noticing it (neither zfs scrub, nor SMART is reporting any errors), the work with this data can lead to a COMPLETE loss of the volume!! Last week, I tried to delete one of these corrupted folders, which lead to an immediate reboot of the server (tons of dump-information on the screen, then reboot). From there on, the server was not bootable anymore (always, when Freenas wanted to load the ZFS volume, it crashed). Only a deletion of the Volume made Freenas coming up again, but when I tried to import the ZFS volume again, Freenas crashed immediately. So I had 100% data loss without any prior warning of Freenas or ZFS!
OK, I already hear you saying: This is for sure a RAM issue.
But let me state this:
- This behaviour occurred twice on two different machines
- Both machines were tested with memtest86 for more than 24 hours (about 10 passes) without any errors
- As far as I understood ZFS, a RAM-Problem would cause quite some damage to the stored data. Now think about, that I copied almost 3TB of data to the machines, without ANY single Bit-error (compared after copy with Beyond Compare, Binary compare). So the chance, that a RAM-issue is not affecting just a single bit from 24 billions, but only and ONLY the meta-data of two folders, and that on two different machines independently is - in my understanding - low to zero.
- ZFS has never reported any error. As far as I understood, scrub would try to repair all the data and report errors if it reads wrong data from RAM - this has never happed, and the data was still there (until Freenas crashed completely).
So, do you think, this could still be a RAM-issue?
So again:
My first intention is to warn anyone who is using replication: Regardless, where this error is coming from, it can destroy all data in one single moment, which is quite hard when you trust ZFS.
My second intention is: Finding the error! Does anyone have an idea, where this error can come from? Could it be, that "zfs receive" corrupts a dataset, when it is interupted during replication (because of broken stream, or because of "invalid stream" (incomplete source file))? Why does ZFS NOT detect these obvious file-system errors during a scrub? Are there more possibilities to check the integrity of a file system (and repair them)?
Phew. I hope, anyone will read this (at least the important parts) and can help me in any way, although this post is so long. I just try to make the situation as clear as possible.
Best regards,
Oliver
I have a huge Problem with replicating ZFS datasets on a remote server. So my intention of this (long) post is: Warn any ZFS-replication users AND (hopefully) get some help in what I could do.
Here are the Specs first:
Two Servers, both having AMD dual core 64 Bit CPUs
Both have 8GB RAM (non-ECC)
Both have 6x 2TB Drives installed and running (Raidz2 resulting in around 6TB of usable diskspace)
Both are located in different offices, connnected via VPN.
The configuration is as follows:
On each server, there is a local dataset wich is locally used as normal working set (Server 1 4TB; Server 2 2TB). Each dataset has a snapshot task. The resulting snapshots are dumped incrementally to files, which are then transferred to the remote server (herefore, the autorepl.py-script has been modified, so the commands stayed the same with the only difference, that the zfs send does not send directly over SSH, but to a file first, and the zfs receive does not read from an SSH-stream, but from the according file.
Code:
First: replcmd = '(/sbin/zfs send -V %s%s%s > /mnt/ZFSVol/replDS/local/snap_temp/%s && echo Suceeded.) > %s 2>&1' % (Rflag, snapname, limit, snapname, templog) and later: replcmd = '(%s -p %d %s "/sbin/zfs receive -F -d %s < /mnt/ZFSVol/replDS/remote/%s && echo Succeeded.") > %s 2>&1' % (sshcmd, remote_port, remote, remotefs, snapname, templog)
The files are (depending on the size) transferred via WAN and rsync or via external drive). This is due to the possibility of resuming a broken transfer if the WAN-connection breaks in between - but this has not been used so far. Every file so far was transferred directly (via SSH-tunnel or with a transfer-drive).
The Problem:
On a first look, the replications are working perfect on both sides. ZFS is not reporting any errors and the datasets and snapshots are showing up as expected.
BUT:
Having a closer look to the replicated file system, one can see, that the meta-data is corrupted on some more or less top-level folders (deeper in the tree, there are no known errors so far).
Code:
[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# ll total 405 ls: ./.: No such file or directory drwxrwx--- 10 admin admin_grp 10 Dec 31 00:41 ./ drwxrwx---+ 4 admin admin_grp 4 Dec 29 21:28 ../ drwxrwx---+ 2 admin admin_grp 7 Dec 30 00:27 BACKUP/ drwxrwx---+ 4 admin admin_grp 4 Dec 30 13:52 DVD Birthday/ drwxrwx---+ 40 admin admin_grp 40 Dec 30 14:00 Private/ ls: ./Photos: No such file or directory drwxrwx--- 20 root admin_grp 20 Dec 31 01:16 Photos/ drwxrwx---+ 2 admin admin_grp 11 Mar 23 2012 usw/ drwxrwx---+ 6 admin admin_grp 16 Sep 22 2012 maps/ drwxrwx---+ 2 admin admin_grp 116 Dec 24 2008 email/ drwxrwx---+ 4 admin admin_grp 4 Dec 31 00:31 safe/
What's this? The folder, ls is currently listing is not existant (ls: ./.: No such file or directory)?
If I try to change the owner on these folders (only two of a couple of thousands), the behaviour is quite strange:
Code:
[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# chown root Photos/ [root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# ll total 405 ls: ./.: No such file or directory drwxrwx--- 10 admin admin_grp 10 Dec 31 00:41 ./ drwxrwx---+ 4 admin admin_grp 4 Dec 29 21:28 ../ drwxrwx---+ 2 admin admin_grp 7 Dec 30 00:27 BACKUP/ drwxrwx---+ 4 admin admin_grp 4 Dec 30 13:52 DVD Birthday/ drwxrwx---+ 40 admin admin_grp 40 Dec 30 14:00 Private/ ls: ./Photos: No such file or directory drwxrwx--- 20 root admin_grp 20 Dec 31 01:16 Photos/ drwxrwx---+ 2 admin admin_grp 11 Mar 23 2012 usw/ drwxrwx---+ 6 admin admin_grp 16 Sep 22 2012 maps/ drwxrwx---+ 2 admin admin_grp 116 Dec 24 2008 email/ drwxrwx---+ 4 admin admin_grp 4 Dec 31 00:31 safe/ [root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# chown admin Photos/ chown: Photos/: No such file or directory [root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# ll total 405 ls: ./.: No such file or directory drwxrwx--- 10 admin admin_grp 10 Dec 31 00:41 ./ drwxrwx---+ 4 admin admin_grp 4 Dec 29 21:28 ../ drwxrwx---+ 2 admin admin_grp 7 Dec 30 00:27 BACKUP/ drwxrwx---+ 4 admin admin_grp 4 Dec 30 13:52 DVD Birthday/ drwxrwx---+ 40 admin admin_grp 40 Dec 30 14:00 Private/ ls: ./Photos: No such file or directory drwxrwx--- 20 admin admin_grp 20 Dec 31 01:16 Photos/ drwxrwx---+ 2 admin admin_grp 11 Mar 23 2012 usw/ drwxrwx---+ 6 admin admin_grp 16 Sep 22 2012 maps/ drwxrwx---+ 2 admin admin_grp 116 Dec 24 2008 email/ drwxrwx---+ 4 admin admin_grp 4 Dec 31 00:31 safe/
Hum - ok. The first chown worked without problem, the second one is complaining, but works as well.
As well, it is no Problem to access all the data in the folders:
Code:
[root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry# cd Photos/ [root@server1] /mnt/ZFSVol/remoteDS/localDS/ultrasafeDS/Harry/Photos# ll total 714 ls: ./.: No such file or directory drwxrwx--- 20 admin admin_grp 20 Dec 31 01:16 ./ ls: ./..: No such file or directory drwxrwx--- 10 admin admin_grp 10 Dec 31 00:41 ../ drwxrwx---+ 2 admin admin_grp 50 Mar 30 2012 2009-07-04 Birthday/ (.......about 50 folders, all ok....)
Ok, I can CD to this folder, but it is not existant - but all of the sub-directories and data are.
So, everything more or less would be ok for the users (with the drawback, that these folders can not be accessed by any non-root-user, since the permission information is missing). But the backup is still there and can be used in case of emergency.
BUT (and here comes a warning to all ZFS-replication-users):
This ZFS-Volume is corrupted!! Altough ZFS is NOT noticing it (neither zfs scrub, nor SMART is reporting any errors), the work with this data can lead to a COMPLETE loss of the volume!! Last week, I tried to delete one of these corrupted folders, which lead to an immediate reboot of the server (tons of dump-information on the screen, then reboot). From there on, the server was not bootable anymore (always, when Freenas wanted to load the ZFS volume, it crashed). Only a deletion of the Volume made Freenas coming up again, but when I tried to import the ZFS volume again, Freenas crashed immediately. So I had 100% data loss without any prior warning of Freenas or ZFS!
OK, I already hear you saying: This is for sure a RAM issue.
But let me state this:
- This behaviour occurred twice on two different machines
- Both machines were tested with memtest86 for more than 24 hours (about 10 passes) without any errors
- As far as I understood ZFS, a RAM-Problem would cause quite some damage to the stored data. Now think about, that I copied almost 3TB of data to the machines, without ANY single Bit-error (compared after copy with Beyond Compare, Binary compare). So the chance, that a RAM-issue is not affecting just a single bit from 24 billions, but only and ONLY the meta-data of two folders, and that on two different machines independently is - in my understanding - low to zero.
- ZFS has never reported any error. As far as I understood, scrub would try to repair all the data and report errors if it reads wrong data from RAM - this has never happed, and the data was still there (until Freenas crashed completely).
So, do you think, this could still be a RAM-issue?
So again:
My first intention is to warn anyone who is using replication: Regardless, where this error is coming from, it can destroy all data in one single moment, which is quite hard when you trust ZFS.
My second intention is: Finding the error! Does anyone have an idea, where this error can come from? Could it be, that "zfs receive" corrupts a dataset, when it is interupted during replication (because of broken stream, or because of "invalid stream" (incomplete source file))? Why does ZFS NOT detect these obvious file-system errors during a scrub? Are there more possibilities to check the integrity of a file system (and repair them)?
Phew. I hope, anyone will read this (at least the important parts) and can help me in any way, although this post is so long. I just try to make the situation as clear as possible.
Best regards,
Oliver