Hung Replication

PhiloEpisteme · Oct 14, 2019

Hi folks,

Source Machine
Backplane: SAS846EL2
FreeNAS Release: FreeNAS-11.2-RELEASE-U3
Board: SuperMicro X11SSM-F
Processor: Intel Core i3-7100 @ 3.90GHz, 2 Cores, 4 Threads
Memory: 2x16GB Crucial CT16G4WFD8266 DDR4-2666 ECC UDIMM
NIC: Chelsio T520-CR
HBA: LSI/Broadcom SAS9207-8i, 6Gbps SAS PCI-E 3.0 HBA, Firmware 20.00.07.00
tank: 1x RAIDZ2, 6x 3TB vdev
vault (encrypted): 1x RAIDZ2, 6x 2TB vdev
Boot pool: mirrored 2 x 64GB SuperMicro SATA DOM SSD-DM064-SMCMVN1

Target Machine
FreeNAS Release: FreeNAS-11.2-RELEASE-U6
Board: SuperMicro X9SCM-F
Processor: Intel Xeon E2-1220 @ 3.10 GHz, 4 Cores, 4 Threads
Memory: 1x8GB Crucial CT102472BD160B DDR3 SDRAM-1600 ECC UDIMM DDR3L
HBA: LSI/Broadcom SAS9210-8i, 6Gbps SAS PCI-E 2.0 HBA, Firmware 20.00.07.00
Expander: IBM 46M0997 16-Port SAS2, Firmware 634A
Backplane: Supermicro BPN-SAS-836TQ
NIC: Chelsio T520-CR
tank1: 1x mirrored, 2x 8TB vdev
tank2: 1x mirrored, 2x 8TB vdev
vault1: 1x mirrored, 2x 8TB vdev
vault2: 1x mirrored, 2x 8TB vdev
Boot pool: 1 mirror vdev, 2 x 64GB TOSHIBA-TR200 SSD

Network:
The two machines are connected via fibre directly between the two 10G NICs.

For about a year now my backup strategy was as follows. Every month or so I swap out tank1 and vault1 with tank2 and vault2 in my backup server. At the time of the swap I also disable the replication tasks in source machine for those two pools I am swapping out and enable the two new ones. There are 4 replication tasks on the source machine; one for each target on the target machine.

Recently I had to rebuild some of my target replication pools. I rebuilt tank1 and vault1 completely. Replication was normal to vault1 and only took a few hours from an empty pool to replication completed.

Replication for tank1, however, was impossibly slow. Rather than reporting a percent complete when clicking the replication icon in the top right as what happened with vault1, tank1 only shows "sending"; the same was true in the replication task menu. After several hours only a few hundred megabytes were transferred.

I have verified that I am able to SSH between machines using the key in /data/ssh/replication.

The replication task is configured to "Recursively Replicate Child Dataset Snapshots" and "Delete Stale Snapshots on Remote System". The begin time is 00:00:00 and the end time is 23:59:00. I am not using a dedicated user.

Snapshots are configured as recursive snapshots also from 00:00:00 to 23:59:00 taken every 15 minutes keeping snapshots up to 3 days. Only a single snapshot task is configured for this pool and it is configured for the top-most dataset.

I've tried debugging myself and come up empty handed. On the sending machine I moved the system dataset off of vault and on to the boot drive; that had no effect.

I have also tried deleting and recreating the zfs send configuration with no luck.

I don't see anything immediately obviously relevant in /var/log/messages except the following likely related to moving the system dataset.

Code:

Oct 15 00:00:00 NAS newsyslog[74345]: logfile turned over due to size>200K
Oct 15 00:00:00 NAS syslog-ng[19121]: Configuration reload request received, reloading configuration;
Oct 15 00:00:00 NAS syslog-ng[19121]: Configuration reload finished;
Oct 15 03:40:55 NAS syslog-ng[19121]: syslog-ng shutting down; version='3.14.1'
Oct 15 03:46:34 NAS syslogd: kernel boot file is /boot/kernel/kernel
Oct 15 03:47:23 NAS /autosnap.py: [tools.autosnap:535] Failed to create snapshot 'vault@auto-20191015.0347-1w': cannot create snapshot 'vault/.system/cores': dataset does not exist cannot create snapshot 'vault/.system/cores@auto-20191015.0347-1w': dataset does not exist no snapshots were created

In /var/log/debug.log I see the following

Code:

Oct 15 03:47:21 NAS /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i vault/iocage/jails/phabricator/root@auto-20191014.1325-1w vault/iocage/jails/phabricator/root@auto-20191014.1330-1w | /usr/local/bin/lz4c | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 23 192.168.13.149 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'vault2' && echo Succeeded"
Oct 15 03:47:23 NAS /autosnap.py: [tools.autosnap:614] Autorepl running, skip destroying snapshots
Oct 15 03:47:23 NAS /autorepl.py: [tools.autorepl:150] Replication result: Succeeded
Oct 15 03:47:23 NAS /autorepl.py: [tools.autorepl:131] Sending zfs snapshot: /sbin/zfs send -V -p -i vault/iocage/jails/phabricator/root@auto-20191014.1330-1w vault/iocage/jails/phabricator/root@auto-20191014.1335-1w | /usr/local/bin/lz4c | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 23 192.168.13.149 "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'vault2' && echo Succeeded"
Oct 15 03:47:24 NAS /autorepl.py: [tools.autorepl:172] Checking if process 20038 is still alive
Oct 15 03:47:24 NAS /autorepl.py: [tools.autorepl:176] Process 20038 still working, quitting

I'm sure I've got something misconfigured but I am having a hard time identifying exactly what. I've seen posts where folks have had similar issues and either their fix didn't seem to work for me or their situation was different (such as bugs with earlier versions of FreeNAS).

Let me know if there is other information I can provide to help troubleshoot; otherwise I'm hoping someone can point me in the right direction.

PhiloEpisteme · Oct 15, 2019

Update on this. I took a look at the processes and found that the replication task was starting over and over infinitely. I killed the parent replication process to stop this and then checked in the logs and found the following command issuing over and over.

Code:

NAS# /sbin/zfs send -V -p -i tank/iocage/jails/phabricator/root@auto-20191014.1330-1w tank/iocage/jails/phabricator/root@auto-20191014.1335-1w | /usr/local/bin/lz4c | /usr/local/bin/pipewatcher $$ | /usr/local/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p {port} {target-ip} "/usr/bin/env lz4c -d | /sbin/zfs receive -F -d 'tank1' && echo Succeeded

I ran this myself and got the following error.

Code:

cannot receive incremental stream: most recent snapshot of tank1/iocage/jails/phabricator/root does not
match incremental source

I went to the target machine and deleted all snapshots with "phabricator" in them and tried to restart replication again to no avail.

This still did not solve the problem.

PhiloEpisteme · Oct 15, 2019

As an update, I'm destroying all snapshots on the target pool (this is EXTREMELY slow) and hoping this solves the issue. I also ran a quick iperf3 test and am seeing ~9Gb/s, so the issue isn't in the network hardware I don't think.

Important Announcement for the TrueNAS Community.

Hung Replication

PhiloEpisteme

Guru

PhiloEpisteme

Guru

PhiloEpisteme

Guru

Similar threads