Replication

jerryjharrison · Aug 27, 2015

I am using 2 newly built Freenas servers running version FreeNAS-9.3-STABLE-201508250051, which is the latest version as of this post.

I have set up replication, but I am getting the following error in the status screen of the replication task on PUSH: (Failed: zpool1/system (auto-20150827.0701-2w)

I can send the snapshot from PUSH via the command line with no issues using:
zfs send zpool1/system@auto-20150827.0701-2w | ssh -p 45 -i /data/ssh/replication 10.0.1.5 zfs receive -F zpool1/replication/system@auto-20150827.0701-2w

/var/log/auth.log on PULL shows from the automated replication attempt:
Aug 27 09:10:02 backnas sshd[7617]: Accepted publickey for root from 10.0.1.3 port 19192 ssh2: RSA 0c:b8:26:20:ab:e5:60:cc:b6:42:d5:8c:8d:e6:5f:de
Aug 27 09:10:02 backnas sshd[7617]: Received disconnect from 10.0.1.3: 11: disconnected by user
Aug 27 09:10:02 backnas sshd[7622]: Accepted publickey for root from 10.0.1.3 port 13578 ssh2: RSA 0c:b8:26:20:ab:e5:60:cc:b6:42:d5:8c:8d:e6:5f:de
Aug 27 09:10:02 backnas sshd[7622]: Received disconnect from 10.0.1.3: 11: disconnected by user

/var/log/messages on PUSH reflects:
Aug 27 09:10:02 PrimeNAS autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o Connect$

Now for the strange part - I can create a dataset on PULL named "system" within the "replication" dataset, and using the initialize checkbox on the replication task, it will succeed once. When the next snapshot is created it goes back to failing. When the initial successful snapshot transfer is made, I end up with the following as a dataset on PULL: "/zpool1/replication/system/system" even though the remote target is set for "zpool1/replication."

Any ideas on where to go from here?

dlavigne · Aug 27, 2015

There's already this bug regarding the new replication code: https://bugs.freenas.org/issues/11130. If it does not cover your scenario, please create another report using System -> Support as that will include your config and debug info. Note that you still need to first have an account at bugs.freenas.org to use this tab. If you create a report, post the issue number here.

jerryjharrison · Aug 27, 2015

I am not sure if this is the same issue or not, but I added the above post as a comment to the but report: https://bugs.freenas.org/issues/11130.

adrianwi · Aug 27, 2015

I'm having very similar problems, and whilst it might be related to the bug report listed, I'm not sure. I'm going to raise another one once I've pulled all the information and screenshots together.

gunnarsson · Aug 29, 2015

If it hasn't already happened, I would say update 9.3-STABLE-201508250051 should to be pulled out before it causes problems to more people.

Last night I had my boxes replicate between them a month old snapshot which already existed on both boxes and was just minutes from expiring so that while it was being replicated the host was unsuccessfully trying to destroy it. I didn't want to interrupt the process in case it would only make matters worse, so I left it running.

This morning once I got up, I rolled back to the previous boot env but it appears that the datasets that had been "successfully replicated" with 9.3-STABLE-201508250051 have gone bad and the replicator isn't able to recover on its own. It will repeat the following endlessly - take note of the particularly interesting "failed with Succeeded":

Code:

Aug 29 10:48:36 haugur1 autorepl.py: [tools.autorepl:433] Replication of vol1/data/misc@auto-20150828.1025-2m failed with Succeeded
Aug 29 10:49:07 haugur1 autorepl.py: [common.pipesubr:70] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -l root -p 59687 hurd-h2.dyndns.o$
Aug 29 10:50:38 haugur1 autorepl.py: [common.pipesubr:70] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -l root -p 59687 hurd-h2.dyndns.o$
Aug 29 10:50:44 haugur1 autorepl.py: [tools.autorepl:416] Remote and local mismatch after replication: vol1/data/misc: local=auto-20150828.1025-2m vs remote=auto-20150829.0425-2m

jerryjharrison · Aug 29, 2015

I have to echo this position. I ended up deleting ALL snapshots on PUSH in order to try and get a clean replication process running before resigning myself to the fact that there are newly introduced bugs in the replication code...

rogerh · Aug 30, 2015

The new replication code seems to be running my pre-existing replication tasks (but not updating the GUI). But if I try to start a new replication task it does not work in the current update.

Dave Genton · Aug 31, 2015

rogerh said:
The new replication code seems to be running my pre-existing replication tasks (but not updating the GUI). But if I try to start a new replication task it does not work in the current update.

Same for me. After updating to 9.3.1 I ended up deleting all snapshots. The same replication I have had running for over a year is no longer working and destroyed ALOT for what appears to be bugs in the code to me. Running cli commands for testing reachability etc. per the manual complete successfully but replication has failed since 9.3.1. Was just logging in to test further and see if I could get it working today, checked forums first and seeing posts that others are having issues as well. Hope this is the last 9.3.1 issue I face after this rough week...

jerryjharrison · Sep 3, 2015

Just upgraded my "sandbox" to the latest build on the stable train, and it seems to have corrected the replication errors I was experiencing. I will run in testing for a couple of days, and then apply to production, and update this forum.

jerryjharrison · Sep 5, 2015

So far so good. The production box is now replicating properly. I have noticed that the replication task on PUSH no longer shows the last snapshot sent at any point in time...

rogerh · Sep 5, 2015

jerryjharrison said:
So far so good. The production box is now replicating properly. I have noticed that the replication task on PUSH no longer shows the last snapshot sent at any point in time...

That's https://bugs.freenas.org/issues/11130

Are your snapshot tasks on PUSH deleting their stale snapshots properly? Mine have stopped doing so.

jerryjharrison · Sep 5, 2015

None of my snapshots have aged to the point of being stale yet. I had deleted all snapshots off of push in trying to get replication to begin working again. Shortest life on any of my snapshots is 2 weeks, so I will know on the 18th if it is working...

rogerh · Sep 5, 2015

jerryjharrison said:
None of my snapshots have aged to the point of being stale yet. I had deleted all snapshots off of push in trying to get replication to begin working again. Shortest life on any of my snapshots is 2 weeks, so I will know on the 18th if it is working...

Thanks. I was worried whether and why my one day lifetime snapshots were being culled, so I set up a one hour lifetime task (on a very sparse dataset!) and replicated that. That shows the problem quickly, if it is present.

Edit: BTW, I don't think not displaying the last snapshot is just cosmetic; I no longer get non-fatal error messages from PULL, and at least at one point I was getting "Succeeded" in the GUI even when a snapshot had failed to replicate.

jerryjharrison · Sep 7, 2015

I may now have a bigger issue. I am seeing the replication task run every minute. The most frequent automated snapshot runs every 4 hours, and all of the rest are 12 or 24 hours. At first I thought it was just processing through all of the stored snapshots, but now I have the same number of snapshots on PUSH and on PULL, but PUSH is still trying to send a snapshot every minute. This is impacting server performance as it is trying EVERY replication task (there are about 15 of them) every minute. Anybody else seeing this. I noticed it in /var/log/messages

hugovsky · Sep 7, 2015

Yes. I have that problem too.

jerryjharrison · Sep 22, 2015

2 upgrades to 9.3 STABLE since the previous posts, and I am now on build FreeNAS-9.3-STABLE-201509220011. I decided to start with a completely clean test, so I created a ZFS Dataset on PUSH called sandtest. I placed 2 small txt files in that dataset, and created an automated snapshot that is going to create a snapshot once per day, with a lifetime of 2 days. After it created the snapshot, I created a replication task on PUSH to send the snapshots to PULL. The task completed successfully the first time it ran at 4:05:08 AM. The replication task then proceeds to run again every minute from then on, even thought no more snapshots are being created. See the message log below.

Sep 22 04:01:58 sandnas notifier: Starting collectd.
Sep 22 04:02:05 sandnas autosnap.py: [tools.autosnap:71] Popen()ing: /sbin/zfs snapshot "zpool1/sandtest@auto-20150922.0402-2d"
Sep 22 04:04:12 sandnas notifier: Performing sanity check on sshd configuration.
Sep 22 04:05:06 sandnas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 45 10.0.1.5 "zfs list -H -t snapshot -p -o name,creation -d 1 -r '/zpool1/sandreplication/sandtest'"
Sep 22 04:05:56 sandnas notifier: Performing sanity check on sshd configuration.
Sep 22 04:06:08 sandnas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 45 10.0.1.5 "zfs list -H -t snapshot -p -o name,creation -d 1 -r 'zpool1/sandreplication/sandtest'"
Sep 22 04:07:06 sandnas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 45 10.0.1.5 "zfs list -H -t snapshot -p -o name,creation -d 1 -r 'zpool1/sandreplication/sandtest'"
Sep 22 04:08:06 sandnas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 45 10.0.1.5 "zfs list -H -t snapshot -p -o name,creation -d 1 -r 'zpool1/sandreplication/sandtest'"
Sep 22 04:09:06 sandnas autorepl.py: [common.pipesubr:71] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 45 10.0.1.5 "zfs list -H -t snapshot -p -o name,creation -d 1 -r 'zpool1/sandreplication/sandtest'"

I have found that when I create all 15 of the replication tasks, and each is trying to run every minute, that I get periodic timouts from other remote tasks trying to access PUSH.

Could there be a configuration issue I am missing?

rogerh · Sep 22, 2015

I've got nine replication tasks that do the same very minute. It does not seem to put any significant load on either PUSH or PULL, either in terms of CPU, memory or network. There are the usual 20% CPU spikes every minute which occur with reporting, but might be a little bigger because of the replication tasks. And these are not very powerful computers. Could there be something else causing your timeouts? Does the setting for stream compression or SSH encryption make any difference?

jerryjharrison · Sep 23, 2015

Great point. Neither PUSH or PULL are hitting any longer term problems in Memory, CPU, or Network. I am just seeing random failures to connect from remote machines on AFP shares, that I had never experienced when replication was not trying to run 20 or 25 times a minute. I associated the two together because I had not changed the configuration of the machines after the clean install.

rogerh · Sep 23, 2015

Let us know if you find out more. It would be interesting if anyone reading this has noticed any similar problem after updating.

Important Announcement for the TrueNAS Community.

Replication

Explorer

dlavigne

Guest

Explorer

Guru

Cadet

Explorer

Guru

Contributor

Explorer

Explorer

Guru

Explorer

Guru

Explorer

Guru

Explorer

Guru

Explorer

Guru

Similar threads