strange replication issues between two FreeNAS-9.3-STABLE-201602031011 servers

Status
Not open for further replies.

merit

Cadet
Joined
Mar 3, 2016
Messages
2
Hi all,

Since a little while we are seeing strange issues with replication between two freenas servers, both on FreeNAS-9.3-STABLE-201602031011. Our debug.log on the sending server is full of messages like this, for all datasets:
nas# zcat -f debug.log.0.bz2 | less
Mar 3 00:00:01 nas autosnap.py: [tools.autosnap:71] Popen()ing: /sbin/zfs list -t snapshot -H -o name
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:233] Autosnap replication started
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:234] temp log file: /tmp/repl-43055
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:304] Checking dataset ourdata/civicrm
Mar 3 00:00:37 nas autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "ourdata/civicrm"
Mar 3 00:00:37 nas autorepl.py: [tools.autorepl:329] ds = civicrm, remotefs = new
Mar 3 00:00:39 nas autorepl.py: [tools.autorepl:337] Unable to create remote dataset new: cannot create 'new/civicrm': dataset already exists
Mar 3 00:00:39 nas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 8346 remote.duckdns.org "zfs list -H -t snapshot -p -o name,creation -r 'new/civicrm'"
Mar 3 00:00:40 nas autorepl.py: [tools.autorepl:304] Checking dataset ourdata/dc2
Mar 3 00:00:40 nas autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "ourdata/dc2"
Mar 3 00:00:42 nas autorepl.py: [tools.autorepl:329] ds = dc2, remotefs = new
Mar 3 00:00:43 nas autorepl.py: [tools.autorepl:337] Unable to create remote dataset new: cannot create 'new/dc2': dataset already exists
Mar 3 00:00:43 nas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 8346 remote.duckdns.org "zfs list -H -t snapshot -p -o name,creation -r 'new/dc2'"
Mar 3 00:00:45 nas autorepl.py: [tools.autorepl:304] Checking dataset ourdata/dc3
Mar 3 00:00:45 nas autorepl.py: [common.pipesubr:71] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "ourdata/dc3"
Mar 3 00:00:47 nas autorepl.py: [tools.autorepl:329] ds = dc3, remotefs = new
Mar 3 00:00:48 nas autorepl.py: [tools.autorepl:337] Unable to create remote dataset new: cannot create 'new/dc3': dataset already exists
Mar 3 00:00:48 nas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 8346 remote.duckdns.org "zfs list -H -t snapshot -p -o name,creation -r 'new/dc3'"

And this keeps happening, all night long, over and over again for all datasets.

We are also seeing 'gigantic' amounts (400 - 500 Gig) of data being transferred between the two machines. So perhaps the data is transferred, but somehow discarded. And then a new replication attempt is done.

About the replication config: Recursively replicate child dataset's snapshots / Delete stale snapshots on remote system are both checked.

Does someone have an idea to solve this? Replication between these machines has been working for more than a year, this has started happening fairly recently.

On a related note: our debug.log / messages.log only go back a few days, is there a reliable way to keep more logs? (i understand all changes should be done through the gui, and I can't find where to adjust this..?)

Thanks
 
D

dlavigne

Guest
Replication has had a bit of a code upheaval lately. Does checking the "Delete stale snapshots on remote system" solve the issue? If not, it is worth creating a bug report at bugs.freenas.org. If you do, post the issue number here.
 

merit

Cadet
Joined
Mar 3, 2016
Messages
2
Hi Dru,

Thanks for the follow-up. It seems (perhaps: for now...) that a reboot of the sending server fixed things. Waited for a few days to be sure, but we have had no replication failures for several days now.

Will post here again if the issue re-emerges.

Thanks!
 
Status
Not open for further replies.
Top