After successful replication - shutting down pull computer generates error message

Henry L · Sep 17, 2015

Is it a bug when the pull computer is not found (shutdown) to generate an error message (after all replication requirements have been met successfully) or working as designed?

Currently running: 9.3-STABLE-201509160044

My Replication procedure has been weekly my backup box turns on at a per-determined time. Later, send a snapshot to a backup box, then later shut down. This process been working great for past year or so till the August update. Now, when the backup box is shut down (after a successful replication) every 60 seconds I receive the following email error report generated from autorepl.py:

Hello,

The replication failed for the local ZFS xx/xx while attempting to

send snapshot auto-xxxx to xxx.xxx.xxx.xxx

When I restart the backup box, the message stops.

I deleted all 52 snapshots on the backup box and let replication re create the snapshots. - (took a while) - That did not help.

Recreated a new boot usb. (had to anyhow, was getting low on space) - That did not help

As a work around, I setup the replication window to be only 1 min, was all day. Now 6 out of 7 days I get 1 failure notice a day when the backup server is shutdown.

dlavigne · Sep 17, 2015

Is the receiving end down when snapshots are being pushed to it? That message sounds like it is.

Henry L · Sep 17, 2015

No, receiving end is left up until snapshots are all replicated and completed (just one is replicated).

As a test, I deleted all snapshots on the pull side and let them all rebuild, 52 snapshots in total. When all were complete, checked to make sure all snapshots were there, manually shut down and the error messages start.

dlavigne · Sep 17, 2015

What is the schedule for taking snapshots, and what is the schedule for replication?

Henry L · Sep 17, 2015

Snapshot set to -
Begin: 00:00:00
End: 00:15:00

Replication set to -
Begin: 07:00:00
End: 07:01:00
Was set to begin at 00:00:00 End: 23:59:00 - changed to 1 min at 7am to stop error emails every 60 seconds.

Cron job shuts down at 18:00

dlavigne · Sep 26, 2015

Were you able to figure this out?

Henry L · Sep 30, 2015

dlavigne said:
Were you able to figure this out?

No - Sorry about the slow reply.

I have updated to the latest : FreeNAS-9.3-STABLE-201509282017

The error message email has changed to: "Replication nas/ds -> 192.168.xxx.xxx:nas failed: Failed: ssh: connect to host 192.168.xxx.xxx port 22: Operation timed out"

autorepl.py kicks off once every 60 seconds during the Begin and End period of the Replication Task settings, which is were the message is generated from. I have it set to 1 min each day to limit the error message emails.

Am I the only one having this issue?

Thanks for checking.

mpfusion · Sep 30, 2015

Henry L said:
No - Sorry about the slow reply.
The error message email has changed to: "Replication nas/ds -> 192.168.xxx.xxx:nas failed: Failed: ssh: connect to host 192.168.xxx.xxx port 22: Operation timed out"

Am I the only one having this issue?

Nope, you're not. Since a few updates we're receiving dozens of emails from our freenas systems every day. Our replication server is online 24/7, but the WAN link is terribly flaky and there's nothing we can do about that.

Have a look at #11550 and #9315 which are related to your problem. I hope that both get fixed at some point because at the moment we're simply ignoring the freenas emails, which defeats the purpose of those emails.

Henry L · Sep 30, 2015

mpfusion said:
Nope, you're not. Since a few updates we're receiving dozens of emails from our freenas systems every day. Our replication server is online 24/7, but the WAN link is terribly flaky and there's nothing we can do about that.

Have a look at #11550 and #9315 which are related to your problem. I hope that both get fixed at some point because at the moment we're simply ignoring the freenas emails, which defeats the purpose of those emails.

mpfusino - 11550 - looks like your are having the same issue when the WAN link is not available. It appears the server and pops the same error email when your WAN connection flakes out, even though all of the data sets has completed replication without error. Essentially the same issue as mine. Thanks for noting, I missed that bug report.

bug number 9315- I would want to know if there is a failed replication. I agree that it would be best that the email does not need to fire every 60 seconds notifying the issue.

Thank You

George Kyriazis · Dec 18, 2015

I am getting the same error, but on a LAN setup. Both PUSH and PULL are on the same local network, and both are up 24/7. I am even logged in remotely to PULL using ssh, and I am getting those messages while being logged in, so the machine is definitely up, and not timing out for other connections.

I am not getting those messages all the time (every minute), but pretty frequently. There are busts, but I am usually getting between 50-100 alerts a day.

Any way I can find out which ssh command fails? Some place where I can add some verbosity options to ssh and get more information?

dlavigne · Dec 18, 2015

Any way I can find out which ssh command fails? Some place where I can add some verbosity options to ssh and get more information?

Try the ssh -vv (both v's) as in the example at http://doc.freenas.org/9.3/freenas_storage.html#troubleshooting-replication.

George Kyriazis · Dec 18, 2015

dlavigne said:
Try the ssh -vv (both v's) as in the example at http://doc.freenas.org/9.3/freenas_storage.html#troubleshooting-replication.

ssh'ing in the machine works, there is no problem there. The problem only exists (sometimes) during replication. In what file can I specify additional flags for the ssh command used just for replication? Ideally I would get the verbose output of the command in the email alert, so I can correlate it with the errors. Hopefully this will help tracking down the problem.

Thanks!

dlavigne · Jan 1, 2016

As a followup, were you able to figure this out?

George Kyriazis · Jan 1, 2016

No, nobody has replied with a solution. Any suggestions are welcomed!

I've put a very specific email filter to move those emails to the trash, but that's not a real solution.

grubx64 · Sep 3, 2017

Is there something new? I'm stuck on the same problem. My backupmaschine only runs one day a week. So I get a lot of allert emails. Although there is no new snapshot to replicate. Is it possible to run replication tasks on a certain day? That would be a idea for a workaround.

Edit:
ok here is a quick 'n dirty fix:
adding this in autoreply.py (line 245). Not optimal, but works for me

Code:

for replication in replication_tasks:
	# ...
	if not datetime.datetime.now().weekday() == 4:
		log.debug("No replication. It is not frightday!")
		continue

Important Announcement for the TrueNAS Community.

After successful replication - shutting down pull computer generates error message

Henry L

Dabbler

dlavigne

Guest

Henry L

Dabbler

dlavigne

Guest

Henry L

Dabbler

dlavigne

Guest

Henry L

Dabbler

mpfusion

Contributor

Henry L

Dabbler

George Kyriazis

Dabbler

dlavigne

Guest

George Kyriazis

Dabbler

dlavigne

Guest

George Kyriazis

Dabbler

grubx64

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

After successful replication - shutting down pull computer generates error message

Dabbler

dlavigne

Guest

Dabbler

dlavigne

Guest

Dabbler

dlavigne

Guest

Dabbler

Contributor

Dabbler

Dabbler

dlavigne

Guest

Dabbler

dlavigne

Guest

Dabbler

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "After successful replication - shutting down pull computer generates error message"

Similar threads