Replication of large snapshots over unstable link, ssh disconnecting causing restart of transfer

logan893

Dabbler
Joined
Dec 31, 2015
Messages
44
I am trying to achieve snapshot replication over a high-latency VPN between Europe and Asia, but the connection keeps dropping before the first snapshot is successfully transferred, and each retry is restarting the transfer from the beginning.

Sending server:
FreeNAS 9.10.2-U6 (virtualized, VMware ESXi 6.7)
12 GB RAM, 2 vCPU ( Xeon E3-1226 v3 )
Internet connectivity is a symmetrical 250/250 Mbps

Receiving server:
TrueNAS 12.0-U8 (virtualized, VMware ESXi 7.0)
16 GB RAM, 2 vCPU ( Pentium Gold 5405U )
Internet connectivity is a symmetrical 1000/1000 Mbps (shared in large building, speeds typically reach 500+ Mbps)

A copy of pfSense is running on each local virtualization host, and they are linked using Wireguard VPN. RTT latency is around 265 ms.

Transfer speed from sending to receiving server fluctuates quite a bit between 15-60 Mbps. Setting the sending side TCP to use htcp (net.inet.tcp.cc.algorithm=htcp) the transfer speed seems to average around 25 Mbps.

The first snapshot the replication function attempts to synchronize is 58 GB (which is sent without compression -- I had to disable transfer compression due to the version difference between sender and receiver systems), and continuously checking "zfs list" on the receiving server I see the initial snapshot replication reach up to approximately 1-2 GB before the connection is reset and the transfer restarts from the beginning.

/var/log/debug.log on sending server shows only that the connection is closed by remote host (I assume some hiccup due to the poor link and high latency), and approximately 20 seconds prior to this the throughput as seen from pfSense has dropped to zero.

Feb 24 04:38:01 freenas autorepl.py: [tools.autorepl:157] Replication result: Connection to truenas.localdomain closed by remote host. Failed to write to stdout: Broken pipe

"zfs recv" has an option "-s" that creates a token for use with "zfs send" to resume from a partially received state, but I do not see that this is used by the replication function available in the GUI on either system.

As a trial I am manually sending this initial snapshot using the resume functionality. It requires fetching the latest token after each partial failure using "zfs get" command like "/sbin/zfs get -H -o value receive_resume_token <target_dataset>", then passing this token to "zfs send -t <token>".

For those curious, sending without a token, when using "zfs recv -s", gives this error:

cannot receive new filesystem stream: destination mypool/mydataset/backup contains partially-complete state from "zfs receive -s".

If I use an incorrect (old) token for "zfs send -t" to attempt to resume, the error is not very helpful.

cannot receive resume stream: kernel modules must be upgraded to receive this stream.

Finally, my question:

Is there any way to configure (via GUI or otherwise) the automatic ZFS replication to allow for resume of snapshot transfers, or is there some other way to mitigate this situation via built-in functionality?

Helpful resources:
 

awasb

Patron
Joined
Jan 11, 2021
Messages
415
Last edited:

logan893

Dabbler
Joined
Dec 31, 2015
Messages
44
Thanks @awasb for the suggestions, it's great to hear about alternatives even if they are not native. With the source server being on such an old version of FreeNAS, I am however unable to easily install either of these two there. Perhaps I could get it working on the target server.

Your links indirectly led me to find out about the new TrueNAS replication function, zettarepl. Seems like this one supports resume, and PULL. It seems to only work through the use of user root over ssh, but I'm giving it a try.

So far it seems much more stable using this PULL replication to TrueNAS than the PUSH replication from FreeNAS (>7 GB transferred without a hiccup so far). Could perhaps be that today is just a good day for the packets traversing the interwebs, but the PUSH replication just minutes before still didn't behave well.

And for some future helpful reference, to remove the partially transferred snapshot, I used "zfs recv -A <filesystem/volume>".
 
Top