I'm having a problem where zfs replication from one trueNAS to another reports 'ERROR' even though the replication seems to succeed. I'm trying to monitor it and this thing keeps crying wolf even though nothing is wrong. I fear that we won't notice once the replication task actually does fail. With how flaky/unreliable zfs replication is, I don't want to take that risk. I'm trying to replicate a dataset "tank/shares" into "tank/acme/nas01_backup/" recursively. It contains some common 'shared folders' like 'data', 'common', 'setup', and some profile datasets under 'profiles', 'profiles/user1', etc. Including everything there are 9 datasets in total.
The Error is that the "command returns with a non-zero exit value". It appears to report a value of '1'. I find either '1' or '127' when I try to make a 'manual' version to be stored inside "$?" but exactly how to format the bash is something I don't know (just read the stuff below, and try to disentangle that mess of quotes and ampersands).
What I did was modify the file "/usr/lib/python3/dist-packages/zettarepl/transport/async_exec_tee.py" to output the contents of the 'args' variable in there, then set replication down to 1 hour, and restart middlewared. I got it outputting this (IP addresses redacted);
The error doesn't really report anything. Without adding the extra print, I just get this as the exception message:
This is the stack trace (note, line in async_exec_tee will be a bit different due to me adding some debug statements in it, I believe the original would report line 104 instead of 109).
I tried to make an equivalent bash command, and run it on the command line with a verbose flag added.
This will output the same 'send' lines, but also some more info for the receive side.
A few things here stand out:
I'm clueless as to what's going on exactly at this point.
The Error is that the "command returns with a non-zero exit value". It appears to report a value of '1'. I find either '1' or '127' when I try to make a 'manual' version to be stored inside "$?" but exactly how to format the bash is something I don't know (just read the stuff below, and try to disentangle that mess of quotes and ampersands).
What I did was modify the file "/usr/lib/python3/dist-packages/zettarepl/transport/async_exec_tee.py" to output the contents of the 'args' variable in there, then set replication down to 1 hour, and restart middlewared. I got it outputting this (IP addresses redacted);
Code:
Command: Pipe((Pipe((Pipe((['sh', '-c', '(zfs send -V -R -w -i tank/shares@auto-2024-01-23_10-00 -L -c tank/shares@auto-2024-01-23_11-00 & PID=$!; echo "zettarepl: zfs send PID is $PID" 1>&2; wait $PID)'], ['lz4c'])), ['throttle', '-B', '5242880'])), ['ssh', '-i', '/tmp/tmp3f_yu9eu', '-o', 'UserKnownHostsFile=/tmp/tmpury0d3xx', '-o', 'StrictHostKeyChecking=yes', '-o', 'BatchMode=yes', '-o', 'ConnectTimeout=10', '-p9122', 'sync@1.2.3.4', 'sh -c \'PATH=$PATH:/usr/local/sbin:/usr/sbin:/sbin sh -c \'"\'"\'exec 3>&1; eval $(exec 4>&1 >&3 3>&-; { lz4c -d 4>&-; echo "pipestatus0=$?;" >&4; } | { zfs recv -s -F -x sharesmb -x sharenfs -x mountpoint tank/acme/nas01_backup 4>&-; echo "pipestatus1=$?;" >&4; }); [ $pipestatus0 -ne 0 ] && exit $pipestatus0; [ $pipestatus1 -ne 0 ] && exit $pipestatus1; exit 0\'"\'"\'\''])) (Pipe((Pipe((['sh', '-c', '(zfs send -V -R -w -i tank/shares@auto-2024-01-23_10-00 -L -c tank/shares@auto-2024-01-23_11-00 & PID=$!; echo "zettarepl: zfs send PID is $PID" 1>&2; wait $PID)'], ['lz4c'])), ['throttle', '-B', '5242880'])), ['ssh', '-i', '/tmp/tmp3f_yu9eu', '-o', 'UserKnownHostsFile=/tmp/tmpury0d3xx', '-o', 'StrictHostKeyChecking=yes', '-o', 'BatchMode=yes', '-o', 'ConnectTimeout=10', '-p9122', 'sync@1.2.3.4', 'sh -c \'PATH=$PATH:/usr/local/sbin:/usr/sbin:/sbin sh -c \'"\'"\'exec 3>&1; eval $(exec 4>&1 >&3 3>&-; { lz4c -d 4>&-; echo "pipestatus0=$?;" >&4; } | { zfs recv -s -F -x sharesmb -x sharenfs -x mountpoint tank/acme/nas01_backup 4>&-; echo "pipestatus1=$?;" >&4; }); [ $pipestatus0 -ne 0 ] && exit $pipestatus0; [ $pipestatus1 -ne 0 ] && exit $pipestatus1; exit 0\'"\'"\'\''
The error doesn't really report anything. Without adding the extra print, I just get this as the exception message:
Code:
send from @auto-2024-01-23_10-00 to tank/shares/profiles@auto-2024-01-23_11-00 estimated size is 624B send from @auto-2024-01-23_10-00 to tank/shares/profiles/syncadm@auto-2024-01-23_11-00 estimated size is 624B send from @auto-2024-01-23_10-00 to tank/shares/profiles/administratie@auto-2024-01-23_11-00 estimated size is 624B send from @auto-2024-01-23_10-00 to tank/shares/profiles/user1@auto-2024-01-23_11-00 estimated size is 624B send from @auto-2024-01-23_10-00 to tank/shares/profiles/user2@auto-2024-01-23_11-00 estimated size is 624B send from @auto-2024-01-23_10-00 to tank/shares/common@auto-2024-01-23_11-00 estimated size is 624B send from @auto-2024-01-23_10-00 to tank/shares/setup@auto-2024-01-23_11-00 estimated size is 624B send from @auto-2024-01-23_10-00 to tank/shares/data@auto-2024-01-23_11-00 estimated size is 624B total estimated size is 5.48K
This is the stack trace (note, line in async_exec_tee will be a bit different due to me adding some debug statements in it, I believe the original would report line 104 instead of 109).
Code:
Traceback (most recent call last): File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 181, in run_replication_tasks retry_contains_partially_complete_state( File "/usr/lib/python3/dist-packages/zettarepl/replication/partially_complete_state.py", line 16, in retry_contains_partially_complete_state return func() File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 182, in <lambda> lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context, File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 278, in run_replication_task_part run_replication_steps(step_templates, observer) File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 666, in run_replication_steps replicate_snapshots(step_template, incremental_base, snapshots, include_intermediate, encryption, observer) File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 707, in replicate_snapshots run_replication_step(step, observer) File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 787, in run_replication_step ReplicationProcessRunner(process, monitor).run() File "/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py", line 33, in run raise self.process_exception File "/usr/lib/python3/dist-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process self.replication_process.wait() File "/usr/lib/python3/dist-packages/zettarepl/transport/ssh.py", line 158, in wait stdout = self.async_exec.wait() File "/usr/lib/python3/dist-packages/zettarepl/transport/async_exec_tee.py", line 109, in wait raise ExecException(exit_event.returncode, "Command: %s %s ... Output: %s " % (self.args_bak, self.args_bak.cmds, self.output))
I tried to make an equivalent bash command, and run it on the command line with a verbose flag added.
Code:
zfs send -V -R -w -i tank/shares@auto-2024-01-23_10-00 -L -c tank/shares@auto-2024-01-23_11-00 | ssh -i ~/.ssh/stor02.key -p 9122 sync@1.2.3.4 /sbin/zfs recv -s -F -x sharesmb -x sharenfs -x mountpoint tank/acme/nas01_backup
This will output the same 'send' lines, but also some more info for the receive side.
Code:
receiving incremental stream of tank/shares@auto-2024-01-23_12-00 into tank/acme/nas01_backup@auto-2024-01-23_12-00 snap tank/acme/nas01_backup@auto-2024-01-23_12-00 already exists; ignoring received 0B stream in 1 seconds (0B/sec) receiving incremental stream of tank/shares/profiles/syncadm@auto-2024-01-23_12-00 into tank/acme/nas01_backup/profiles/syncadm@auto-2024-01-23_12-00 snap tank/acme/nas01_backup/profiles/syncadm@auto-2024-01-23_12-00 already exists; ignoring received 0B stream in 1 seconds (0B/sec) receiving incremental stream of tank/shares/profiles/user1@auto-2024-01-23_12-00 into tank/acme/nas01_backup/profiles/user1@auto-2024-01-23_12-00 snap tank/acme/nas01_backup/profiles/user1@auto-2024-01-23_12-00 already exists; ignoring received 0B stream in 1 seconds (0B/sec) receiving incremental stream of tank/shares/common@auto-2024-01-23_12-00 into tank/acme/nas01_backup/common@auto-2024-01-23_12-00 snap tank/acme/nas01_backup/common@auto-2024-01-23_12-00 already exists; ignoring received 0B stream in 1 seconds (0B/sec) receiving incremental stream of tank/shares/data@auto-2024-01-23_12-00 into tank/acme/nas01_backup/data@auto-2024-01-23_12-00 cannot receive: failed to read from stream snap tank/acme/nas01_backup/data@auto-2024-01-23_12-00 already exists; ignoring received 0B stream in 1 seconds (0B/sec)
A few things here stand out:
- "Cannot receive: failed to read from stream" error for one of the datasets 'data'.
- Some folders seem skipped in this message? I didn't observe a line for 'profiles/user2' dataset, 'setup' dataset, or 'profiles' dataset. 9 datasets are sent but 6 have a message of being received.
- When I go check on md5 checksums of all the files after a sync, everything is there on the receiving end AFAICT.
Code:
# On the first NAS: cd /mnt/tank/shares/profiles/user2 find . -type f -print0 | sort -z | xargs -0 md5sum | tee ~/data1.log # On the second NAS: cd /mnt/tank/acme/nas01_backup/profiles/user2 find . -type f -print0 | sort -z | xargs -0 md5sum | tee ~/data2.log # Then, after transferring the file diff data1.log data2.log ## Gives no output.
I'm clueless as to what's going on exactly at this point.
Last edited: