Replication for specific dataset fails out of the blue

Status
Not open for further replies.

monovitae

Explorer
Joined
Jan 7, 2015
Messages
55
Howdy all, I've got two freenas boxes connected via LAN, that have for the last year or so been replicating snapshots flawlessly. Now all of a sudden I have a couple of data sets that are spamming my email constantly with the following failure.

The replication failed for the local ZFS tank1/deathstar/hsi/hsiweb while attempting to
apply incremental send of snapshot auto-20160525.2335-30d -> auto-20160526.2335-30d to hsifs2

There are 3 parent level data sets that are being replicated to hsifs2 all with the same settings that are fine.

Options are : Recursively replicate (checked), Delete Stale(checked), lz4, Encryption: Standard.

Other child datasets of tank1/deathstar/hsi/ also seem to replicate. Its just hsiweb that won't go. Currently the sending machine is still taking daily snapshots, but the receiving machine is about 6 days behind now 5/25/2016 being the one that fails.

When it started failing, i hadn't touched either freeNAS in months, after a couple days of errors, I updated both boxes to the latest version FreeNAS-9.10-STABLE-201605240427 (64fcd8e) and rebooted, but it still continues to fail.

Any suggestions on how to fix or research further?

Thanks,
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
Please post the contents of /var/log/messages in code tags.
 

monovitae

Explorer
Joined
Jan 7, 2015
Messages
55
I'm away from the office right now, but I will post as soon as I can get back, tonight or possibly tomorrow morning, thanks for looking.
 

monovitae

Explorer
Joined
Jan 7, 2015
Messages
55
Looks like theres some replication related stuff at the bottom of this file. I think the email timeouts are normal at this point as my mail server thinks the freeNAS box is spamming it. Which it essentially is(i don't know why there isn't a setting for email alert intervals, but I digress).

Code:
May 29 17:53:51 hsifs mountd[2817]: umount request succeeded from 10.0.9.82 for /mnt/tank1/deathstar/hsi/softlib/isos
May 29 17:59:13 hsifs ctld[29435]: 10.0.21.22 (iqn.2015-03.com.hsimt.xen2:xen2): read: connection lost
May 29 17:59:13 hsifs ctld[2789]: child process 29435 terminated with exit status 1
May 29 17:59:17 hsifs ctld[29451]: 10.0.21.22: read: connection lost
May 29 17:59:17 hsifs ctld[2789]: child process 29451 terminated with exit status 1
May 29 17:59:18 hsifs ctld[29452]: 10.0.21.22: read: connection lost
May 29 17:59:18 hsifs ctld[2789]: child process 29452 terminated with exit status 1
May 29 17:59:18 hsifs ctld[29457]: 10.0.20.22: read: connection lost
May 29 17:59:18 hsifs ctld[2789]: child process 29457 terminated with exit status 1
May 29 17:59:19 hsifs ctld[29461]: 10.0.21.22: read: connection lost
May 29 17:59:19 hsifs ctld[2789]: child process 29461 terminated with exit status 1
May 29 17:59:19 hsifs ctld[29462]: 10.0.20.22: read: connection lost
May 29 17:59:19 hsifs ctld[2789]: child process 29462 terminated with exit status 1
May 29 17:59:25 hsifs ctld[29483]: 10.0.20.22: read: connection lost
May 29 17:59:25 hsifs ctld[2789]: child process 29483 terminated with exit status 1
May 29 17:59:25 hsifs ctld[29484]: 10.0.21.22: read: connection lost
May 29 17:59:25 hsifs ctld[2789]: child process 29484 terminated with exit status 1
May 29 17:59:25 hsifs ctld[29487]: 10.0.21.22: read: connection lost
May 29 17:59:25 hsifs ctld[2789]: child process 29487 terminated with exit status 1
May 29 17:59:26 hsifs ctld[29491]: 10.0.20.22: read: connection lost
May 29 17:59:26 hsifs ctld[2789]: child process 29491 terminated with exit status 1
May 29 17:59:26 hsifs ctld[29496]: 10.0.21.22: read: connection lost
May 29 17:59:26 hsifs ctld[2789]: child process 29496 terminated with exit status 1
May 29 17:59:26 hsifs ctld[29500]: 10.0.20.22: read: connection lost
May 29 17:59:26 hsifs ctld[2789]: child process 29500 terminated with exit status 1
May 29 17:59:32 hsifs ctld[29519]: 10.0.20.22: read: connection lost
May 29 17:59:32 hsifs ctld[2789]: child process 29519 terminated with exit status 1
May 29 17:59:46 hsifs mountd[2817]: mount request succeeded from 10.0.9.82 for /mnt/tank1/deathstar/hsi/softlib/isos
May 29 18:06:00 hsifs manage.py: [common.system:286] Failed to send email: [Errno 60] Operation timed out
Traceback (most recent call last):
  File "/usr/local/www/freenasUI/../freenasUI/common/system.py", line 268, in send_mail
  server = _get_smtp_server(timeout, local_hostname=local_hostname)
  File "/usr/local/www/freenasUI/../freenasUI/common/system.py", line 201, in _get_smtp_server
  local_hostname=local_hostname)
  File "/usr/local/lib/python2.7/smtplib.py", line 256, in __init__
  (code, msg) = self.connect(host, port)
  File "/usr/local/lib/python2.7/smtplib.py", line 316, in connect
  self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/local/lib/python2.7/smtplib.py", line 291, in _get_socket
  return socket.create_connection((host, port), timeout)
  File "/usr/local/lib/python2.7/socket.py", line 575, in create_connection
  raise err
error: [Errno 60] Operation timed out
May 29 18:08:26 hsifs alert.py: [common.system:286] Failed to send email: [Errno 60] Operation timed out
Traceback (most recent call last):
  File "/usr/local/www/freenasUI/common/system.py", line 268, in send_mail
  server = _get_smtp_server(timeout, local_hostname=local_hostname)
  File "/usr/local/www/freenasUI/common/system.py", line 201, in _get_smtp_server
  local_hostname=local_hostname)
  File "/usr/local/lib/python2.7/smtplib.py", line 256, in __init__
  (code, msg) = self.connect(host, port)
  File "/usr/local/lib/python2.7/smtplib.py", line 316, in connect
  self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/local/lib/python2.7/smtplib.py", line 291, in _get_socket
  return socket.create_connection((host, port), timeout)
  File "/usr/local/lib/python2.7/socket.py", line 575, in create_connection
  raise err
error: [Errno 60] Operation timed out
May 29 18:16:05 hsifs mountd[2817]: umount request succeeded from 10.0.9.81 for /mnt/tank1/deathstar/hsi/softlib/isos
May 29 18:20:01 hsifs find_alias_for_smtplib.py: [common.system:286] Failed to send email: [Errno 60] Operation timed out
Traceback (most recent call last):
  File "/usr/local/www/freenasUI/common/system.py", line 268, in send_mail
  server = _get_smtp_server(timeout, local_hostname=local_hostname)
  File "/usr/local/www/freenasUI/common/system.py", line 201, in _get_smtp_server
  local_hostname=local_hostname)
  File "/usr/local/lib/python2.7/smtplib.py", line 256, in __init__
  (code, msg) = self.connect(host, port)
  File "/usr/local/lib/python2.7/smtplib.py", line 316, in connect
  self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/local/lib/python2.7/smtplib.py", line 291, in _get_socket
  return socket.create_connection((host, port), timeout)
  File "/usr/local/lib/python2.7/socket.py", line 575, in create_connection
  raise err
error: [Errno 60] Operation timed out
May 29 18:22:14 hsifs ctld[36594]: 10.0.20.21 (iqn.2015-04.com.hsimt.xen1:xen1): read: connection lost
May 29 18:22:14 hsifs ctld[2789]: child process 36594 terminated with exit status 1
May 29 18:22:18 hsifs ctld[36611]: 10.0.20.21: read: connection lost
May 29 18:22:18 hsifs ctld[2789]: child process 36611 terminated with exit status 1
May 29 18:22:18 hsifs ctld[36613]: 10.0.21.21: read: connection lost
May 29 18:22:18 hsifs ctld[2789]: child process 36613 terminated with exit status 1
May 29 18:22:19 hsifs ctld[36616]: 10.0.21.21: read: connection lost
May 29 18:22:19 hsifs ctld[2789]: child process 36616 terminated with exit status 1
May 29 18:22:19 hsifs ctld[36619]: 10.0.20.21: read: connection lost
May 29 18:22:19 hsifs ctld[2789]: child process 36619 terminated with exit status 1
May 29 18:22:20 hsifs ctld[36623]: 10.0.20.21: read: connection lost
May 29 18:22:20 hsifs ctld[2789]: child process 36623 terminated with exit status 1
May 29 18:22:25 hsifs ctld[36646]: 10.0.21.21: read: connection lost
May 29 18:22:25 hsifs ctld[2789]: child process 36646 terminated with exit status 1
May 29 18:22:26 hsifs ctld[36647]: 10.0.21.21: read: connection lost
May 29 18:22:26 hsifs ctld[2789]: child process 36647 terminated with exit status 1
May 29 18:22:26 hsifs ctld[36650]: 10.0.21.21: read: connection lost
May 29 18:22:26 hsifs ctld[2789]: child process 36650 terminated with exit status 1
May 29 18:22:26 hsifs ctld[36651]: 10.0.20.21: read: connection lost
May 29 18:22:26 hsifs ctld[2789]: child process 36651 terminated with exit status 1
May 29 18:22:27 hsifs ctld[36658]: 10.0.20.21: read: connection lost
May 29 18:22:27 hsifs ctld[2789]: child process 36658 terminated with exit status 1
May 29 18:22:32 hsifs ctld[36677]: 10.0.20.21: read: connection lost
May 29 18:22:32 hsifs ctld[2789]: child process 36677 terminated with exit status 1
May 29 18:22:33 hsifs ctld[36678]: 10.0.21.21: read: connection lost
May 29 18:22:33 hsifs ctld[2789]: child process 36678 terminated with exit status 1
May 29 18:22:39 hsifs mountd[2817]: mount request succeeded from 10.0.9.81 for /mnt/tank1/deathstar/hsi/softlib/isos
May 29 18:25:38 hsifs mountd[2817]: umount request succeeded from 10.0.9.82 for /mnt/tank1/deathstar/hsi/softlib/isos
May 29 18:30:57 hsifs ctld[38815]: 10.0.21.22 (iqn.2015-03.com.hsimt.xen2:xen2): read: connection lost
May 29 18:30:57 hsifs ctld[2789]: child process 38815 terminated with exit status 1
May 29 18:31:01 hsifs ctld[38871]: 10.0.21.22: read: connection lost
May 29 18:31:01 hsifs ctld[2789]: child process 38871 terminated with exit status 1
May 29 18:31:01 hsifs ctld[38874]: 10.0.21.22: read: connection lost
May 29 18:31:01 hsifs ctld[2789]: child process 38874 terminated with exit status 1
May 29 18:31:02 hsifs ctld[38878]: 10.0.20.22: read: connection lost
May 29 18:31:02 hsifs ctld[2789]: child process 38878 terminated with exit status 1
May 29 18:31:02 hsifs ctld[38881]: 10.0.21.22: read: connection lost
May 29 18:31:02 hsifs ctld[2789]: child process 38881 terminated with exit status 1
May 29 18:31:02 hsifs ctld[38884]: 10.0.20.22: read: connection lost
May 29 18:31:02 hsifs ctld[2789]: child process 38884 terminated with exit status 1
May 29 18:31:08 hsifs ctld[38904]: 10.0.20.22: read: connection lost
May 29 18:31:08 hsifs ctld[2789]: child process 38904 terminated with exit status 1
May 29 18:31:09 hsifs ctld[38908]: 10.0.21.22: read: connection lost
May 29 18:31:09 hsifs ctld[2789]: child process 38908 terminated with exit status 1
May 29 18:31:09 hsifs ctld[38909]: 10.0.21.22: read: connection lost
May 29 18:31:09 hsifs ctld[2789]: child process 38909 terminated with exit status 1
May 29 18:31:09 hsifs ctld[38919]: 10.0.20.22: read: connection lost
May 29 18:31:09 hsifs ctld[2789]: child process 38919 terminated with exit status 1
May 29 18:31:10 hsifs ctld[38956]: 10.0.21.22: read: connection lost
May 29 18:31:10 hsifs ctld[2789]: child process 38956 terminated with exit status 1
May 29 18:31:10 hsifs ctld[38957]: 10.0.20.22: read: connection lost
May 29 18:31:10 hsifs ctld[2789]: child process 38957 terminated with exit status 1
May 29 18:31:15 hsifs ctld[38983]: 10.0.20.22: read: connection lost
May 29 18:31:15 hsifs ctld[2789]: child process 38983 terminated with exit status 1
May 29 18:31:30 hsifs mountd[2817]: mount request succeeded from 10.0.9.82 for /mnt/tank1/deathstar/hsi/softlib/isos
May 29 18:36:00 hsifs manage.py: [common.system:286] Failed to send email: [Errno 60] Operation timed out
Traceback (most recent call last):
  File "/usr/local/www/freenasUI/../freenasUI/common/system.py", line 268, in send_mail
  server = _get_smtp_server(timeout, local_hostname=local_hostname)
  File "/usr/local/www/freenasUI/../freenasUI/common/system.py", line 201, in _get_smtp_server
  local_hostname=local_hostname)
  File "/usr/local/lib/python2.7/smtplib.py", line 256, in __init__
  (code, msg) = self.connect(host, port)
  File "/usr/local/lib/python2.7/smtplib.py", line 316, in connect
  self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/local/lib/python2.7/smtplib.py", line 291, in _get_socket
  return socket.create_connection((host, port), timeout)
  File "/usr/local/lib/python2.7/socket.py", line 575, in create_connection
  raise err
error: [Errno 60] Operation timed out
May 29 20:35:38 hsifs autorepl.py: [common.system:286] Failed to send email: [Errno 60] Operation timed out
Traceback (most recent call last):
  File "/usr/local/www/freenasUI/common/system.py", line 268, in send_mail
  server = _get_smtp_server(timeout, local_hostname=local_hostname)
  File "/usr/local/www/freenasUI/common/system.py", line 201, in _get_smtp_server
  local_hostname=local_hostname)
  File "/usr/local/lib/python2.7/smtplib.py", line 256, in __init__
  (code, msg) = self.connect(host, port)
  File "/usr/local/lib/python2.7/smtplib.py", line 316, in connect
  self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/local/lib/python2.7/smtplib.py", line 291, in _get_socket
  return socket.create_connection((host, port), timeout)
  File "/usr/local/lib/python2.7/socket.py", line 575, in create_connection
  raise err
error: [Errno 60] Operation timed out
May 29 20:36:48 hsifs alert.py: [common.system:286] Failed to send email: [Errno 61] Connection refused
Traceback (most recent call last):
  File "/usr/local/www/freenasUI/common/system.py", line 268, in send_mail
  server = _get_smtp_server(timeout, local_hostname=local_hostname)
  File "/usr/local/www/freenasUI/common/system.py", line 201, in _get_smtp_server
  local_hostname=local_hostname)
  File "/usr/local/lib/python2.7/smtplib.py", line 256, in __init__
  (code, msg) = self.connect(host, port)
  File "/usr/local/lib/python2.7/smtplib.py", line 316, in connect
  self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/local/lib/python2.7/smtplib.py", line 291, in _get_socket
  return socket.create_connection((host, port), timeout)
  File "/usr/local/lib/python2.7/socket.py", line 575, in create_connection
  raise err
error: [Errno 61] Connection refused
May 29 23:35:01 hsifs autosnap.py: [tools.autosnap:61] Popen()ing: /sbin/zfs snapshot -r "tank1/deathstar/ahg@auto-20160529.2335-30d"
May 29 23:35:01 hsifs autosnap.py: [tools.autosnap:61] Popen()ing: /sbin/zfs snapshot -r "tank1/deathstar/hsi@auto-20160529.2335-30d"
May 29 23:35:01 hsifs autosnap.py: [tools.autosnap:61] Popen()ing: /sbin/zfs snapshot -r "tank1/deathstar/xen@auto-20160529.2335-30d"
May 29 23:35:15 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/xen/xenslow@auto-20160428.2335-30d'"
May 29 23:35:16 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/xen@auto-20160428.2335-30d'"
May 29 23:36:00 hsifs autosnap.py: [tools.autosnap:61] Popen()ing: /sbin/zfs destroy -r -d "tank1/deathstar/ahg@auto-20160429.2335-30d"
May 29 23:36:01 hsifs autosnap.py: [tools.autosnap:61] Popen()ing: /sbin/zfs destroy -r -d "tank1/deathstar/xen@auto-20160429.2335-30d"
May 29 23:36:01 hsifs autosnap.py: [tools.autosnap:61] Popen()ing: /sbin/zfs destroy -r -d "tank1/deathstar/hsi@auto-20160429.2335-30d"
May 29 23:37:03 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/ahg/media@auto-20160429.2335-30d'"
May 29 23:37:04 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/ahg/ragnarostemp@auto-20160429.2335-30d'"
May 29 23:37:05 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/ahg/backups@auto-20160429.2335-30d'"
May 29 23:37:06 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/ahg@auto-20160429.2335-30d'"
May 29 23:37:08 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/hsi/backups@auto-20160429.2335-30d'"
May 29 23:37:09 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/hsi/softlib@auto-20160429.2335-30d'"
May 29 23:37:10 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/hsi/data@auto-20160429.2335-30d'"
May 29 23:38:39 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs destroy -d 'tank2/deathstar/hsi@auto-20160429.2335-30d'"
May 30 00:00:00 hsifs newsyslog[49184]: logfile turned over due to size>100K

 

monovitae

Explorer
Joined
Jan 7, 2015
Messages
55
If you need further logs I can upload them to my dropbox, or I can grep out specific information and post here.
 

monovitae

Explorer
Joined
Jan 7, 2015
Messages
55
I may have discovered the source of my problem.
/var/log/debug
Code:
[root@hsifs] /var/log# tail -f debug.log
Jun  1 19:19:01 hsifs autorepl.py: [tools.autorepl:338] Unable to create remote dataset tank2: cannot create 'tank2/deathstar/ahg': dataset already exists
Jun  1 19:19:01 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs list -H -t snapshot -p -o name,creation -r 'tank2/deathstar/ahg'"
Jun  1 19:19:01 hsifs autorepl.py: [tools.autorepl:305] Checking dataset tank1/deathstar/hsi
Jun  1 19:19:01 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /sbin/zfs list -H -t snapshot -p -o name,creation -r "tank1/deathstar/hsi"
Jun  1 19:19:01 hsifs autorepl.py: [tools.autorepl:330] ds = deathstar, remotefs = tank2
Jun  1 19:19:01 hsifs autorepl.py: [tools.autorepl:338] Unable to create remote dataset tank2: cannot create 'tank2/deathstar': dataset already exists
Jun  1 19:19:01 hsifs autorepl.py: [tools.autorepl:330] ds = deathstar/hsi, remotefs = tank2
Jun  1 19:19:02 hsifs autorepl.py: [tools.autorepl:338] Unable to create remote dataset tank2: cannot create 'tank2/deathstar/hsi': dataset already exists
Jun  1 19:19:02 hsifs autorepl.py: [common.pipesubr:61] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "zfs list -H -t snapshot -p -o name,creation -r 'tank2/deathstar/hsi'"
Jun  1 19:19:02 hsifs autorepl.py: [tools.autorepl:138] Sending zfs snapshot: /sbin/zfs send -V -p -i tank1/deathstar/hsi/data@auto-20160529.2335-30d tank1/deathstar/hsi/data@auto-20160530.2335-30d | /usr/local/bin/lz4c | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null | /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 hsifs2 "/usr/local/bin/lz4c -d | /sbin/zfs receive -F -d 'tank2' && echo Succeeded"
Jun  1 19:20:00 hsifs autosnap.py: [tools.autosnap:61] Popen()ing: /sbin/zfs list -t snapshot -H -o name
Jun  1 19:20:00 hsifs autosnap.py: [tools.autosnap:449] Autorepl running, skip destroying snapshots
Jun  1 19:20:01 hsifs autorepl.py: [tools.autorepl:184] Checking if process 3972 is still alive
Jun  1 19:20:01 hsifs autorepl.py: [tools.autorepl:188] Process 3972 still working, quitting
Jun  1 19:20:33 hsifs autorepl.py: [tools.autorepl:157] Replication result: cannot receive incremental stream: out of space
Error 35 : Write error : cannot write compressed block


Which is kinda surprising, since other snapshots have been going over just fine. Oh well. Got new drives on the way to expand and test. Also it would be nice if the gui or emails would have told me this i could have solved it sooner. Or for the alert window on the recieving side to give a nastier warning, I am getting low on space percentage wise, 15% free but that's 2.7 TB available.
 
Last edited:
Status
Not open for further replies.
Top