Replication (sometimes) fails with "ssh_dispatch_run_fatal"

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
I have two TrueNAS servers in different locations, nasA and nasB. nasB pulls snapshots from nasA every day, using a replication task.

This replication worked well for a long time. I recently had to reinstall the operating system due to a failure of the boot drive, and the problems started after that. Both systems are on TrueNAS-13.0-U3.1.

The replication is set to run in the evening. Often, but not every time, I get an email after about half an hour with the following alert:
Code:
* Replication "NASA volume1" failed: ssh_dispatch_run_fatal: Connection to 10.xx.x.Y port 22: message authentication code incorrect
  (stdin): Unexpected EOF in worker 3
cannot receive incremental stream: checksum mismatch or incomplete stream.
Partially received snapshot is saved.
A resuming stream can be generated on the sending system by running:
    zfs send -t 1-14026c5196-118-789c636064000310a50...


The last line continues with a string of alphanumeric characters, ending in '..'. The console log around the time the mail is sent shows nothing. Settings for the replication task are below.

If anybody has any ideas about this, I would be glad to hear them. Where can I find more information, is there another log?

Screenshot 2022-12-20 at 10-33-37 TrueNAS - nasb.stans.iggland.li.png
 
Last edited:

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
So I found the right log to look at (and some more time..), found the following (in /var/log/zettarepl.log):

The pulling system asks for progress reports during the replication, and gets these many times (roughly every 2%, from 3% to 84%). Then, on a failure this happens:
Code:
zfs send with PID 69643 is gone
ssh_dispatch_run_fatal: Connection to 10.xx.x.Y port 22: message authentication code incorrect


What follows is in the attached error log.

If I run replication of one snapshot manually, I can get a similar message:
Code:
# ssh nasa.example.org zfs send -v volume1/some-dataset@daily-2023-01-02_21-00 | zfs receive volume1/some-dataset/test
full send of volume1/some-dataset@daily-2023-01-02_21-00 estimated size is 48.7G
total estimated size is 48.7G
TIME        SENT   SNAPSHOT volume1/some-dataset@daily-2023-01-02_21-00
10:48:20   2.86M   volume1/some-dataset@daily-2023-01-02_21-00
10:48:21   7.50M   volume1/some-dataset@daily-2023-01-02_21-00
...
    deleted similar lines
...
11:19:22   24.5G   volume1/some-dataset@daily-2023-01-02_21-00
11:19:23   24.5G   volume1/some-dataset@daily-2023-01-02_21-00
Fssh_ssh_dispatch_run_fatal: Connection to 10.xx.x.Y port 22: message authentication code incorrect
cannot receive new filesystem stream: incomplete stream


On the sending system, what I can see around the same time in zettarepl.log is the creation and destruction of other snapshots, on other datasets.. nothing directly to do this the snapshot being transferred in this example.

Any ideas? These two examples are different data sets. This problem seems to occur with different datasets, and does not fail every time either.
 

Attachments

  • zettarepl_extract.txt
    7.6 KB · Views: 95
Last edited:

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
Going the other way, ie doing a manual push, the replication also quits with an error:

Code:
client_loop: send disconnect: Broken pipe
 

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
I submitted a bug report, where the solution seems to be to make sure the replication is restarted after failing:

I never figured this out - for the time being, I manually ran the replication as many times as required each day to make it reach the end (each time it's started, there is some progression).
For a few days now, I have not had any more failures. I have no idea why, or if that isn't just luck.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
For both the replication sender and target enable SSH keep alives by setting this Auxiliary parameter in the SSH service:

ClientAliveInterval 60
 
Joined
Oct 29, 2018
Messages
2
@Samuel Tai - Thanks for the input- I do have that set on both NAS machines (13.0-U4). Unfortunately, I still have the same issues- sometimes after 5 minutes, sometimes 15. I am trying to replicate a 14TB dataset, so @AbsolutIggy suggestion won't work very well to keep manually restarting it until it completes; though his solution did work to replicate a 250GB dataset.

Just not sure where to turn at this point. The bug ticket opened up didn't seem to suggest a root cause or a workaround.
 

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
@Samuel Tai Thanks, I did not have that set so I've added it now. However, since this problem hasn't appeared in nearly 2 weeks I don't know if I can tell if it makes a difference.

@datafiregroup I agree my solution isn't really a solution - but I didn't have an alternative, as I want the replication. Bit of a PITA though, running manually every day. Considered mkaing a script starting the replication through the API, but never got around to it. The total dataset is around 7 TB though, not just 250 GB. It wasn't always the same dataset causing trouble, but seemed to vary. Failure times varied, as did the amount of transferred data.

Have you tried replicating your snapshots locally, without going over SSH and seeing if you have the same issue?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
I think what's going on is SSH mis-interpreting a byte in the replication stream as an SSH ~ command, which causes it to close the connection, and bork the replication. Try switching to SSH + netcat.
 

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
OK, so my error has started appearing again - I've now switched to SSH+NETCAT, let's see if that makes the problem go away permanently.
 

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
After continuing the replication with SSH+NETCAT, it didn't take long for another error to appear - it's slightly different to before:

Code:
Replication "NAS volume1" failed: Passive side: I/O error Active side: checksum mismatch or incomplete stream. Partially received snapshot is saved. A resuming stream can be generated on the sending system by running:    zfs send -t 1-16581aee9a-128-789c636064000310a501c49c503..


(With more alphanumeric characters following..)
 

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
Yeah SSH+NETCAT definitely is not any better. The failures keep occurring.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
You've got some sort of problem with the replication source. How often do you scrub your pool?
 

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
Hmm interesting idea - but scrubs are scheduled every month.
 

AbsolutIggy

Dabbler
Joined
Feb 29, 2020
Messages
31
@datafiregroup If you're still having this problem, maybe the following will help:

I've put this command in my crontab - it uses the API to check the status of the replication. If it's in an error state, the replication is restarted. You'll need to check the ID of your replication, and exchange it in two places.
Code:
if [ $(curl -s -X GET -H "Authorization: Bearer 1-H..." "http://localhost/api/v2.0/replication/id/1" | jq '.state.state') = '"ERROR"' ]; then curl -s -X POST -H "Authorization: Bearer 1-H..." "http://localhost/api/v2.0/replication/id/1/run"; fi


It now runs every 10 minutes on my NAS - I still get error messages, but at least now I don't have to go restart manually every day.
 
Top