Temporary replication failure

Joined
Jan 3, 2019
Messages
8
Hello,

We have 7 FreeNAS servers that are fully replicated onto one giant FreeNAS backup server.
Since a few days, we have issue with some replications. They are working great but suddenly the stop working with a nearly empty notification.
Later, it will retry and work again without any user intervention. All servers are online all the time.

We've also tried to set it up with a single snapshot dataset to be sure the snapshot list is identical on both ends.

Any clue?

Here is the report we receive by mail:
Code:
FreeNAS @ remote_server.fqdn

New alerts:
* Replication "dataset_name" failed: None
Command failed with code 1..



Here are the full logs for the replication failure:
Code:
[2020/06/04 15:00:00] DEBUG    [replication_task__task_7] [zettarepl.transport.local.shell.1.async_exec.14578] Running ['zfs', 'get', '-H', '-p', 'type', 'raid/project_files/<dataset_name>']
[2020/06/04 15:00:00] DEBUG    [replication_task__task_7] [zettarepl.transport.local.shell.1.async_exec.14578] Success: 'raid/project_files/<dataset_name>\ttype\tfilesystem\t-\n'
[2020/06/04 15:00:00] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095] Connecting...
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] starting thread (client mode): 0x14a6de90
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Local version/idstring: SSH-2.0-paramiko_2.7.1
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Remote version/idstring: SSH-2.0-OpenSSH_8.0-hpn14v15
[2020/06/04 15:00:00] INFO     [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Connected (version 2.0, client OpenSSH_8.0-hpn14v15)
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] kex algos:['curve25519-sha256', 'curve25519-sha256@libssh.org', 'ecdh-sha2-nistp256', 'ecdh-sha2-nistp384', 'ecdh-sha2-nistp521', 'diffie-hellman-group-exchange-sha256', 'diffie-hellman-group16-sha512', 'diffie-hellman-group18-sha512', 'diffie-hellman-group14-sha256', 'diffie-hellman-group14-sha1'] server key:['rsa-sha2-512', 'rsa-sha2-256', 'ssh-rsa', 'ecdsa-sha2-nistp256', 'ssh-ed25519'] client encrypt:['chacha20-poly1305@openssh.com', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'aes128-gcm@openssh.com', 'aes256-gcm@openssh.com', 'aes128-cbc', 'none'] server encrypt:['chacha20-poly1305@openssh.com', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'aes128-gcm@openssh.com', 'aes256-gcm@openssh.com', 'aes128-cbc', 'none'] client mac:['umac-64-etm@openssh.com', 'umac-128-etm@openssh.com', 'hmac-sha2-256-etm@openssh.com', 'hmac-sha2-512-etm@openssh.com', 'hmac-sha1-etm@openssh.com', 'umac-64@openssh.com', 'umac-128@openssh.com', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] server mac:['umac-64-etm@openssh.com', 'umac-128-etm@openssh.com', 'hmac-sha2-256-etm@openssh.com', 'hmac-sha2-512-etm@openssh.com', 'hmac-sha1-etm@openssh.com', 'umac-64@openssh.com', 'umac-128@openssh.com', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Kex agreed: ecdh-sha2-nistp256
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] HostKey agreed: ssh-ed25519
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Cipher agreed: aes128-ctr
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] MAC agreed: hmac-sha2-256
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Compression agreed: none
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] kex engine KexNistp256 specified hash_algo <built-in function openssl_sha256>
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Switch to new keys ...
[2020/06/04 15:00:00] DEBUG    [replication_task__task_7] [zettarepl.paramiko.replication_task__task_7] Trying SSH key b'*******************************************'
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] userauth is OK
[2020/06/04 15:00:00] INFO     [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Authentication (publickey) successful!
[2020/06/04 15:00:00] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14580] Running ['zfs', 'get', '-H', '-p', 'type', 'raid/project_files/<dataset_name>']
[2020/06/04 15:00:00] DEBUG    [replication_task__task_7] [zettarepl.paramiko.replication_task__task_7] [chan 0] Max packet in: 32768 bytes
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Received global request "hostkeys-00@openssh.com"
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Rejecting "hostkeys-00@openssh.com" global request from server.
[2020/06/04 15:00:00] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Debug msg: b'/root/.ssh/authorized_keys:18: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding'
[2020/06/04 15:00:01] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 0] Max packet out: 32768 bytes
[2020/06/04 15:00:01] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Secsh channel 0 opened.
[2020/06/04 15:00:01] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 0] Sesch channel 0 request ok
[2020/06/04 15:00:01] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14580] Waiting for exit status
[2020/06/04 15:00:01] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 0] EOF received (0)
[2020/06/04 15:00:01] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14580] Success: 'raid/project_files/<dataset_name>\ttype\tfilesystem\t-\n'
[2020/06/04 15:00:01] DEBUG    [replication_task__task_7] [zettarepl.paramiko.replication_task__task_7] [chan 0] EOF sent (0)
[2020/06/04 15:00:01] DEBUG    [replication_task__task_7] [zettarepl.transport.local.shell.1.async_exec.14582] Running ['zfs', 'list', '-t', 'filesystem,volume', '-H', '-o', 'name', '-s', 'name', '-d', '1', 'raid/project_files/<dataset_name>']
[2020/06/04 15:00:01] DEBUG    [replication_task__task_7] [zettarepl.transport.local.shell.1.async_exec.14582] Success: 'raid/project_files/<dataset_name>\n'
[2020/06/04 15:00:01] DEBUG    [replication_task__task_7] [zettarepl.transport.local.shell.1.async_exec.14584] Running ['zfs', 'list', '-t', 'snapshot', '-H', '-o', 'name', '-s', 'name', '-d', '1', 'raid/project_files/<dataset_name>']
[2020/06/04 15:00:01] DEBUG    [replication_task__task_7] [zettarepl.transport.local.shell.1.async_exec.14584] Success: 'raid/project_files/<dataset_name>@auto-20200604.1500-1w\n'
[2020/06/04 15:00:01] DEBUG    [replication_task__task_7] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14586] Running ['zfs', 'list', '-t', 'filesystem,volume', '-H', '-o', 'name', '-s', 'name', '-d', '1', 'raid/project_files/<dataset_name>']
[2020/06/04 15:00:01] DEBUG    [replication_task__task_7] [zettarepl.paramiko.replication_task__task_7] [chan 1] Max packet in: 32768 bytes
....
OTHER CHAN INFO
....
[2020/06/04 15:13:03] DEBUG    [replication_task__task_7.monitor] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14648] Success: 'raid/project_files/<dataset_name>....16a71717e36031c0000aa2536e7\t-\n'
[2020/06/04 15:13:03] DEBUG    [replication_task__task_7.monitor] [zettarepl.paramiko.replication_task__task_7] [chan 18] EOF sent (18)
[2020/06/04 15:13:03] DEBUG    [replication_task__task_7.monitor] [zettarepl.replication.monitor] receive_resume_tokens: count=13, unique=13
[2020/06/04 15:13:18] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 4] Unhandled channel request "keepalive@openssh.com"
[2020/06/04 15:13:34] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 4] Unhandled channel request "keepalive@openssh.com"
[2020/06/04 15:13:49] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 4] Unhandled channel request "keepalive@openssh.com"
[2020/06/04 15:14:01] DEBUG    [replication_task__task_7.process] [zettarepl.transport.local.shell.1.async_exec.14592] Success: ''
[2020/06/04 15:14:03] DEBUG    [replication_task__task_7.monitor] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14649] Running ['zfs', 'get', '-H', '-p', 'receive_resume_token', 'raid/project_files/<dataset_name>']
[2020/06/04 15:14:03] DEBUG    [replication_task__task_7.monitor] [zettarepl.paramiko.replication_task__task_7] [chan 19] Max packet in: 32768 bytes
[2020/06/04 15:14:03] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 4] Unhandled channel request "keepalive@openssh.com"
[2020/06/04 15:14:03] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 19] Max packet out: 32768 bytes
[2020/06/04 15:14:03] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] Secsh channel 19 opened.
[2020/06/04 15:14:03] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 19] Sesch channel 19 request ok
[2020/06/04 15:14:03] DEBUG    [replication_task__task_7.monitor] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14649] Waiting for exit status
[2020/06/04 15:14:03] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 19] EOF received (19)
[2020/06/04 15:14:03] DEBUG    [replication_task__task_7.monitor] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14649] Success: 'raid/project_files/<dataset_name>....a528b8bf3b3116e07002c4f34c4\t-\n'
[2020/06/04 15:14:03] DEBUG    [replication_task__task_7.monitor] [zettarepl.paramiko.replication_task__task_7] [chan 19] EOF sent (19)
[2020/06/04 15:14:03] DEBUG    [replication_task__task_7.monitor] [zettarepl.replication.monitor] receive_resume_tokens: count=14, unique=14
[2020/06/04 15:14:07] DEBUG    [Thread-2627] [zettarepl.paramiko.replication_task__task_7] [chan 5] EOF received (5)
[2020/06/04 15:14:07] DEBUG    [replication_task__task_7.async_exec_tee.wait] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14591] Error 1: None
[2020/06/04 15:14:07] DEBUG    [replication_task__task_7.listen_exec.wait] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14590] Error 1: ''
[2020/06/04 15:14:07] DEBUG    [replication_task__task_7.process] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14591] Stopping
[2020/06/04 15:14:07] DEBUG    [replication_task__task_7.process] [zettarepl.transport.local.shell.1.async_exec.14592] Stopping
[2020/06/04 15:14:07] DEBUG    [replication_task__task_7.monitor] [zettarepl.transport.base_ssh.root@remote_serveur.fqdn.shell.1095.async_exec.14591] Stopping
[2020/06/04 15:14:07] ERROR    [replication_task__task_7] [zettarepl.replication.run] For task 'task_7' unhandled replication error SshNetcatExecException(None, ExecException(1, ''))
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 139, in run_replication_tasks
    run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer)
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 200, in run_replication_task_part
    run_replication_steps(step_templates, observer)
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 372, in run_replication_steps
    replicate_snapshots(step_template, incremental_base, snapshots, observer)
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 427, in replicate_snapshots
    run_replication_step(step_template.instantiate(incremental_base=incremental_base, snapshot=snapshot), observer)
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 481, in run_replication_step
    ReplicationProcessRunner(process, monitor).run()
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 33, in run
    raise self.process_exception
  File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process
    self.replication_process.wait()
  File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/ssh_netcat.py", line 189, in wait
    raise SshNetcatExecException(None, self.listen_exec_error)
zettarepl.transport.ssh_netcat.SshNetcatExecException: None
Command failed with code 1
 
Joined
Jan 3, 2019
Messages
8
EDIT
Later, it will retry and work again without any user intervention. Explanations:

The thing is that the snapshots are copied but the task on source server won't tell the correct last copied snapshot.
 
Top