Replication (sometimes) fails with "ssh_dispatch_run_fatal"

AbsolutIggy · Dec 20, 2022

I have two TrueNAS servers in different locations, nasA and nasB. nasB pulls snapshots from nasA every day, using a replication task.

This replication worked well for a long time. I recently had to reinstall the operating system due to a failure of the boot drive, and the problems started after that. Both systems are on TrueNAS-13.0-U3.1.

The replication is set to run in the evening. Often, but not every time, I get an email after about half an hour with the following alert:

Code:

* Replication "NASA volume1" failed: ssh_dispatch_run_fatal: Connection to 10.xx.x.Y port 22: message authentication code incorrect
  (stdin): Unexpected EOF in worker 3
cannot receive incremental stream: checksum mismatch or incomplete stream.
Partially received snapshot is saved.
A resuming stream can be generated on the sending system by running:
    zfs send -t 1-14026c5196-118-789c636064000310a50...

The last line continues with a string of alphanumeric characters, ending in '..'. The console log around the time the mail is sent shows nothing. Settings for the replication task are below.

If anybody has any ideas about this, I would be glad to hear them. Where can I find more information, is there another log?

Screenshot 2022-12-20 at 10-33-37 TrueNAS - nasb.stans.iggland.li.png

AbsolutIggy · Jan 3, 2023

So I found the right log to look at (and some more time..), found the following (in /var/log/zettarepl.log):

The pulling system asks for progress reports during the replication, and gets these many times (roughly every 2%, from 3% to 84%). Then, on a failure this happens:

Code:

zfs send with PID 69643 is gone
ssh_dispatch_run_fatal: Connection to 10.xx.x.Y port 22: message authentication code incorrect

What follows is in the attached error log.

If I run replication of one snapshot manually, I can get a similar message:

Code:

# ssh nasa.example.org zfs send -v volume1/some-dataset@daily-2023-01-02_21-00 | zfs receive volume1/some-dataset/test
full send of volume1/some-dataset@daily-2023-01-02_21-00 estimated size is 48.7G
total estimated size is 48.7G
TIME        SENT   SNAPSHOT volume1/some-dataset@daily-2023-01-02_21-00
10:48:20   2.86M   volume1/some-dataset@daily-2023-01-02_21-00
10:48:21   7.50M   volume1/some-dataset@daily-2023-01-02_21-00
...
    deleted similar lines
...
11:19:22   24.5G   volume1/some-dataset@daily-2023-01-02_21-00
11:19:23   24.5G   volume1/some-dataset@daily-2023-01-02_21-00
Fssh_ssh_dispatch_run_fatal: Connection to 10.xx.x.Y port 22: message authentication code incorrect
cannot receive new filesystem stream: incomplete stream

On the sending system, what I can see around the same time in zettarepl.log is the creation and destruction of other snapshots, on other datasets.. nothing directly to do this the snapshot being transferred in this example.

Any ideas? These two examples are different data sets. This problem seems to occur with different datasets, and does not fail every time either.

AbsolutIggy · Jan 5, 2023

Going the other way, ie doing a manual push, the replication also quits with an error:

Code:

client_loop: send disconnect: Broken pipe

datafiregroup · Mar 9, 2023

@AbsolutIggy - Did you ever figure anything out with this?

I have the identical setup and set of symptoms.

Thanks!

AbsolutIggy · Mar 9, 2023

I submitted a bug report, where the solution seems to be to make sure the replication is restarted after failing:

[NAS-120007] - iXsystems TrueNAS Jira

ixsystems.atlassian.net

I never figured this out - for the time being, I manually ran the replication as many times as required each day to make it reach the end (each time it's started, there is some progression).
For a few days now, I have not had any more failures. I have no idea why, or if that isn't just luck.

Samuel Tai · Mar 9, 2023

For both the replication sender and target enable SSH keep alives by setting this Auxiliary parameter in the SSH service:

ClientAliveInterval 60

datafiregroup · Mar 9, 2023

@Samuel Tai - Thanks for the input- I do have that set on both NAS machines (13.0-U4). Unfortunately, I still have the same issues- sometimes after 5 minutes, sometimes 15. I am trying to replicate a 14TB dataset, so @AbsolutIggy suggestion won't work very well to keep manually restarting it until it completes; though his solution did work to replicate a 250GB dataset.

Just not sure where to turn at this point. The bug ticket opened up didn't seem to suggest a root cause or a workaround.

AbsolutIggy · Mar 10, 2023

@Samuel Tai Thanks, I did not have that set so I've added it now. However, since this problem hasn't appeared in nearly 2 weeks I don't know if I can tell if it makes a difference.

@datafiregroup I agree my solution isn't really a solution - but I didn't have an alternative, as I want the replication. Bit of a PITA though, running manually every day. Considered mkaing a script starting the replication through the API, but never got around to it. The total dataset is around 7 TB though, not just 250 GB. It wasn't always the same dataset causing trouble, but seemed to vary. Failure times varied, as did the amount of transferred data.

Have you tried replicating your snapshots locally, without going over SSH and seeing if you have the same issue?

Samuel Tai · Mar 10, 2023

I think what's going on is SSH mis-interpreting a byte in the replication stream as an SSH ~ command, which causes it to close the connection, and bork the replication. Try switching to SSH + netcat.

AbsolutIggy · Mar 17, 2023

OK, so my error has started appearing again - I've now switched to SSH+NETCAT, let's see if that makes the problem go away permanently.

AbsolutIggy · Mar 17, 2023

After continuing the replication with SSH+NETCAT, it didn't take long for another error to appear - it's slightly different to before:

Code:

Replication "NAS volume1" failed: Passive side: I/O error Active side: checksum mismatch or incomplete stream. Partially received snapshot is saved. A resuming stream can be generated on the sending system by running:    zfs send -t 1-16581aee9a-128-789c636064000310a501c49c503..

(With more alphanumeric characters following..)

AbsolutIggy · Mar 21, 2023

Yeah SSH+NETCAT definitely is not any better. The failures keep occurring.

Samuel Tai · Mar 21, 2023

You've got some sort of problem with the replication source. How often do you scrub your pool?

AbsolutIggy · Mar 21, 2023

Hmm interesting idea - but scrubs are scheduled every month.

AbsolutIggy · Apr 17, 2023

@datafiregroup If you're still having this problem, maybe the following will help:

I've put this command in my crontab - it uses the API to check the status of the replication. If it's in an error state, the replication is restarted. You'll need to check the ID of your replication, and exchange it in two places.

Code:

if [ $(curl -s -X GET -H "Authorization: Bearer 1-H..." "http://localhost/api/v2.0/replication/id/1" | jq '.state.state') = '"ERROR"' ]; then curl -s -X POST -H "Authorization: Bearer 1-H..." "http://localhost/api/v2.0/replication/id/1/run"; fi

It now runs every 10 minutes on my NAS - I still get error messages, but at least now I don't have to go restart manually every day.

Important Announcement for the TrueNAS Community.

Replication (sometimes) fails with "ssh_dispatch_run_fatal"

AbsolutIggy

Dabbler

AbsolutIggy

Dabbler

Attachments

AbsolutIggy

Dabbler

datafiregroup

Cadet

AbsolutIggy

Dabbler

[NAS-120007] - iXsystems TrueNAS Jira

Samuel Tai

Never underestimate your own stupidity

datafiregroup

Cadet

AbsolutIggy

Dabbler

Samuel Tai

Never underestimate your own stupidity

AbsolutIggy

Dabbler

AbsolutIggy

Dabbler

Code:
Replication "NAS volume1" failed: Passive side: I/O error Active side: checksum mismatch or incomplete stream. Partially received snapshot is saved. A resuming stream can be generated on the sending system by running: zfs send -t 1-16581aee9a-128-789c636064000310a501c49c503..

AbsolutIggy

Dabbler

Samuel Tai

Never underestimate your own stupidity

AbsolutIggy

Dabbler

AbsolutIggy

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Replication (sometimes) fails with "ssh_dispatch_run_fatal"

Dabbler

Dabbler

Attachments

Dabbler

Cadet

Dabbler

Never underestimate your own stupidity

Cadet

Dabbler

Never underestimate your own stupidity

Dabbler

Dabbler

Code:Replication "NAS volume1" failed: Passive side: I/O error Active side: checksum mismatch or incomplete stream. Partially received snapshot is saved. A resuming stream can be generated on the sending system by running: zfs send -t 1-16581aee9a-128-789c636064000310a501c49c503..

Dabbler

Never underestimate your own stupidity

Dabbler

Dabbler

Similar threads

Code:
Replication "NAS volume1" failed: Passive side: I/O error Active side: checksum mismatch or incomplete stream. Partially received snapshot is saved. A resuming stream can be generated on the sending system by running: zfs send -t 1-16581aee9a-128-789c636064000310a501c49c503..