SOLVED Recursive zfs send | receive through ssh stalled

Thibaut · Jul 18, 2019

Hello,

We are using FreeNAS (11.2-U4.1) as our main company storage system which has been active for 5 years (starting with FreeNAS 9.3) without any major flaws.

The main storage pool contains all our "active jobs" data, which are all based on the same datasets structure. This structure is quite simple, consisting of one main "parent" dataset, itself containing three "children" datasets one of which being a zvol, as follow:

work-pool/job-dataset
work-pool/job-dataset/dataset-1
work-pool/job-dataset/dataset-2
work-pool/job-dataset/zvol

Once a job has been completed, we archive the whole job to a remote system, currently running Debian 9 with the OpenZFS filesystem kernel modules, using the zfs send | zfs receive commands over ssh, as follow:

Code:

# zfs set readonly=on work-pool/job-dataset
# zfs snapshot -r work-pool/job-dataset@DATE
# zfs send -R work-pool/job-dataset@DATE | ssh root@arc.hiv.serv.ip "zfs receive -F archive-pool/job-dataset"

We haven't encountered any problem using this technique until recently.
For an unidentified reason, some datasets are not getting transferred correctly anymore, while others behave just as expected and get transferred without a problem.

What can be observed with the problematic datasets is as follow:

the recursive snapshots are created as expected
the zfs send -R ... | ssh root@...ip... "zfs receive -F ..." command gets executed
two tasks are appearing on the sending (FreeNAS) system:
- zfs send -R ...
- ssh root@...ip... zfs receive -F ...
one task is appearing on the receiving (Debian 9) system
- zfs receive -F ...
transcient network activity is observed and the "parent" dataset is created on the receiving system but then it hangs and the children datasets never get transferred
network activity stops but the zfs tasks are still active, although cpu usage of the respective tasks are incoherent, using 0% cpu on the sending (FreeNAS) system while displaying 100% cpu usage on the receiving (Debian 9) system:
- SENDING SYSTEM:
  cpu 0% zfs send -R work-pool/job-dataset@DATE
  cpu 0% ssh root@...ip... zfs receive -F archive-pool/job-dataset
- RECEIVING SYSTEM:
  cpu 100% zfs receive -F archive-pool/job-dataset
killing the process using it's process id on the sending system is possible:
- kill -9 ...id...
  warning: cannot send 'work-pool/job-dataset@DATE': signal received
  Killed
killing the process on the receiving system is impossible (whatever kill signal is sent), the ssh process keeps running using 100% of one processor thread until the system is rebooted

Trying to send each "child" dataset on its own gives a slightly different result, in this case the network activity is lasting as long as the dataset is transferring, unfortunately, once all data has been transmitted, the processes hang in the same manner as described above. Which seems to indicate an identical problem since it can be considered that the "parent" dataset only contains very few data, all happens like the transfer operation doesn't get notified once all data has been transmitted and thus hangs forever instead of finishing the transmission.

Sending / receiving the problematic datasets on the same pool (duplicating) works as expected.

Both pools are healthy and have been scrubbed without error.

We were unable to establish any difference between datasets that get transferred without problem and those that stall the transfer, they all are created by duplicating a "master template" dataset using:

Code:

# zfs send -R work-pool/parent/dataset-1@source | zfs receive -F work-pool/job-dataset/dataset-1
# zfs send -R work-pool/parent/dataset-2@source | zfs receive -F work-pool/job-dataset/dataset-2
# zfs send -R work-pool/parent/zvol@source | zfs receive -F work-pool/job-dataset/zvol

Any suggestion, idea or further direction to investigate this problem would be more than welcome!

Thank you.

dlavigne · Jul 24, 2019

Were you able to track down the cause of this? If not, is it still reproducible on 11.2-U5?

Thibaut · Jul 26, 2019

dlavigne said:
Were you able to track down the cause of this?

Unfortunately not :-(
We kept trying to identify any element that could help indicate what causes this behavior but to no avail until now.

dlavigne said:
If not, is it still reproducible on 11.2-U5?

As this is a production system, we couldn't yet find a time slot that would allow us to restart the system. It has been planned to upgrade the system to 11.2-U5 during the coming week-end though.
I'll obviously report here whether this upgrade changes anything regarding the reported situation.

Thank you for paying attention to our problem.

Thibaut · Sep 1, 2019

The 11.2-U5 upgrade didn't solve the problem, but I might finally be on the track of the problem's origin!

While digging to identify the possible causes of this broken behavior, I stumbled upon a similar report made in the Ubuntu bug tracker (#1733230). Which led me to a pull request conversation on the zfsonlinux / zfs GitHub repository (#6616) that, indeed, identifies the problem and explains in great details what is going on under the hood. Feel free to read it in case you're interested in getting all the gory details of what got broken...

In a nutshell, the 0.6.5.x version of zfsonlinux has problems receiving a zfs stream from a more recent version of ZFS. It is stated that this has been fixed in release 0.7.x.

Unfortunately, as stated in the OP, the receiving system in our configuration is currently running Debian 9, which makes use of version 0.6.5.9-5 of zfsonlinux:

Code:

$ lsb_release -rd
Description:    Debian GNU/Linux 9.9 (stretch)
Release:        9.9

$ sudo apt-cache policy zfsutils-linux
zfsutils-linux:                                                      
  Installed: 0.6.5.9-5                                                
  Candidate: 0.6.5.9-5                                                
  Version table:                                                      
*** 0.6.5.9-5 500                                                    
        500 http://deb.debian.org/debian stretch/contrib amd64 Packages
        100 /var/lib/dpkg/status

The latest Debian (10 / Buster) release is currently using zfsonlinux 0.7.12-2+deb10u1, which presumably should fix the problem. So, the next step will be for us to upgrade the receiving system to Debian 10.

I'll post back here once this will be done and once we tested the currently failing transfer of a ZFS dataset... Fingers crossed!

Thibaut · Sep 4, 2019

I'm happy to confirm that upgrading to zfsonlinux 0.7.12 solved the problem!

After upgrading the receiving system to Debian 10 and installing the zfs-dkms and zfsutils-linux packages from Debian's contrib repository, as described on this zfsonlinux page, the zfs send | zfs receive stream transfer over ssh showed no more problem.

On this same page, it is stated that Debian's backports repository "often provides newer releases of ZFS". So it is possible that using the zfs packages from backports on Debian 9 would have allowed us to upgrade to a 0.7.x version, but we didn't want to deal with the complications involved in managing regular packages when the backports are activated, so we didn't bother and went straight to a new Debian 10 installation.

Hopefully the reported experience might help someone else facing the same behavior...

Best regards.

Important Announcement for the TrueNAS Community.

SOLVED Recursive zfs send | receive through ssh stalled

Thibaut

Dabbler

dlavigne

Guest

Thibaut

Dabbler

Thibaut

Dabbler

Thibaut

Dabbler

Similar threads