After successful replication - shutting down pull computer generates error message

Status
Not open for further replies.

Henry L

Dabbler
Joined
Nov 21, 2013
Messages
10
Is it a bug when the pull computer is not found (shutdown) to generate an error message (after all replication requirements have been met successfully) or working as designed?

Currently running: 9.3-STABLE-201509160044

My Replication procedure has been weekly my backup box turns on at a per-determined time. Later, send a snapshot to a backup box, then later shut down. This process been working great for past year or so till the August update. Now, when the backup box is shut down (after a successful replication) every 60 seconds I receive the following email error report generated from autorepl.py:

Hello,

The replication failed for the local ZFS xx/xx while attempting to

send snapshot auto-xxxx to xxx.xxx.xxx.xxx


When I restart the backup box, the message stops.

I deleted all 52 snapshots on the backup box and let replication re create the snapshots. - (took a while) - That did not help.

Recreated a new boot usb. (had to anyhow, was getting low on space) - That did not help

As a work around, I setup the replication window to be only 1 min, was all day. Now 6 out of 7 days I get 1 failure notice a day when the backup server is shutdown.
 
D

dlavigne

Guest
Is the receiving end down when snapshots are being pushed to it? That message sounds like it is.
 

Henry L

Dabbler
Joined
Nov 21, 2013
Messages
10
No, receiving end is left up until snapshots are all replicated and completed (just one is replicated).

As a test, I deleted all snapshots on the pull side and let them all rebuild, 52 snapshots in total. When all were complete, checked to make sure all snapshots were there, manually shut down and the error messages start.
 
D

dlavigne

Guest
What is the schedule for taking snapshots, and what is the schedule for replication?
 

Henry L

Dabbler
Joined
Nov 21, 2013
Messages
10
Snapshot set to -
Begin: 00:00:00
End: 00:15:00

Replication set to -
Begin: 07:00:00
End: 07:01:00
Was set to begin at 00:00:00 End: 23:59:00 - changed to 1 min at 7am to stop error emails every 60 seconds.

Cron job shuts down at 18:00
 

Henry L

Dabbler
Joined
Nov 21, 2013
Messages
10
Were you able to figure this out?
No - Sorry about the slow reply.

I have updated to the latest : FreeNAS-9.3-STABLE-201509282017

The error message email has changed to: "Replication nas/ds -> 192.168.xxx.xxx:nas failed: Failed: ssh: connect to host 192.168.xxx.xxx port 22: Operation timed out"

autorepl.py kicks off once every 60 seconds during the Begin and End period of the Replication Task settings, which is were the message is generated from. I have it set to 1 min each day to limit the error message emails.

Am I the only one having this issue?

Thanks for checking.
 

mpfusion

Contributor
Joined
Jan 6, 2014
Messages
198
No - Sorry about the slow reply.
The error message email has changed to: "Replication nas/ds -> 192.168.xxx.xxx:nas failed: Failed: ssh: connect to host 192.168.xxx.xxx port 22: Operation timed out"

Am I the only one having this issue?

Nope, you're not. Since a few updates we're receiving dozens of emails from our freenas systems every day. Our replication server is online 24/7, but the WAN link is terribly flaky and there's nothing we can do about that.

Have a look at #11550 and #9315 which are related to your problem. I hope that both get fixed at some point because at the moment we're simply ignoring the freenas emails, which defeats the purpose of those emails.
 

Henry L

Dabbler
Joined
Nov 21, 2013
Messages
10
Nope, you're not. Since a few updates we're receiving dozens of emails from our freenas systems every day. Our replication server is online 24/7, but the WAN link is terribly flaky and there's nothing we can do about that.

Have a look at #11550 and #9315 which are related to your problem. I hope that both get fixed at some point because at the moment we're simply ignoring the freenas emails, which defeats the purpose of those emails.

mpfusino - 11550 - looks like your are having the same issue when the WAN link is not available. It appears the server and pops the same error email when your WAN connection flakes out, even though all of the data sets has completed replication without error. Essentially the same issue as mine. Thanks for noting, I missed that bug report.

bug number 9315- I would want to know if there is a failed replication. I agree that it would be best that the email does not need to fire every 60 seconds notifying the issue.

Thank You
 

George Kyriazis

Dabbler
Joined
Sep 3, 2013
Messages
42
I am getting the same error, but on a LAN setup. Both PUSH and PULL are on the same local network, and both are up 24/7. I am even logged in remotely to PULL using ssh, and I am getting those messages while being logged in, so the machine is definitely up, and not timing out for other connections.

I am not getting those messages all the time (every minute), but pretty frequently. There are busts, but I am usually getting between 50-100 alerts a day.

Any way I can find out which ssh command fails? Some place where I can add some verbosity options to ssh and get more information?
 

George Kyriazis

Dabbler
Joined
Sep 3, 2013
Messages
42

ssh'ing in the machine works, there is no problem there. The problem only exists (sometimes) during replication. In what file can I specify additional flags for the ssh command used just for replication? Ideally I would get the verbose output of the command in the email alert, so I can correlate it with the errors. Hopefully this will help tracking down the problem.

Thanks!
 

George Kyriazis

Dabbler
Joined
Sep 3, 2013
Messages
42
No, nobody has replied with a solution. Any suggestions are welcomed!

I've put a very specific email filter to move those emails to the trash, but that's not a real solution.
 

grubx64

Cadet
Joined
May 1, 2017
Messages
8
Is there something new? I'm stuck on the same problem. My backupmaschine only runs one day a week. So I get a lot of allert emails. Although there is no new snapshot to replicate. Is it possible to run replication tasks on a certain day? That would be a idea for a workaround.

Edit:
ok here is a quick 'n dirty fix:
adding this in autoreply.py (line 245). Not optimal, but works for me
Code:
for replication in replication_tasks:
	# ...
	if not datetime.datetime.now().weekday() == 4:
		log.debug("No replication. It is not frightday!")
		continue
 
Last edited:
Status
Not open for further replies.
Top