Replication results in deadlock (rrl->rr_) on zfs reads

parpar · May 19, 2018

I need some help troubleshooting a deadlock on by backup machine. During its incremental replication all processes accessing the data set end up in a lock on rll->rr_ as reported by a ps -axHl -O lwp in the nwchan field.

Set up:
master is doing periodic snapshots (multiple per day, keeping snapshots for multiple weeks to a year). All its replication tasks point to slave but are disabled (as slave is normally off-line)
Once a week, master sends a wake-on-lan to slave, waits for slave to boot and starts a slightly modified autorepl.py (modified to ignore the 'enabled' flag on the replication tasks, so it forcefully replicates, regardless of the flag.

Result:
On a specific data set, the replication tasks stays in Status "Sending" on master. On slave the task
/sbin/zfs receive -F -d seems stalled and processes accessing the data set get into a locked state as mentioned (Python 3.6, collectd or a simple ls on the dataset), for example

Code:

  UID   PID  PPID CPU  PRI NI	VSZ	RSS MWCHAN   STAT TT	   TIME COMMAND			PID	LWP TT  STAT	  TIME COMMAND
	0   213   211   0   20  0 265512 235112 rrl->rr_ D	 -	0:53.81 python3.6: middl   213 101149  -  D	  0:53.81 python3.6: middlewared (python3.6)
	0   251   233   0   20  0   6952   2680 rrl->rr_ D+	2	0:00.00 ls -la /mnt/zdat   251 101915  2  D+	 0:00.00 ls -la /mnt/zdata

Noteworthy:
Killing the entire dataset and pool on slave results -under the same conditions- in a successful replication (no hangs). On incremental replications it is reproducible on the same dataset, however that dataset seems to have no problems. The dataset did not change over the weekly replication interval, so the snapshots are empty.

Hardware:
Master:
Build FreeNAS-11.1-U4
Platform Intel(R) Atom(TM) CPU C2750 @ 2.41GHz
Memory 32703MB
Disks: 4x3TB striped mirror (3TB WD RED)

Slave:
Build FreeNAS-11.1-U4
Platform AMD Turion(tm) II Neo N40L Dual-Core Processor
Geheugen 8121MB
Disks: 4x2TB Raid-Z1 (3x Seagate ST32000542AS + 1x TOSHIBA DT01ACA200)

Any help to further troubleshoot is greatly appreciated!

MrToddsFriends · May 19, 2018

Did you ever run a memtest for several days on the slave?

parpar · May 20, 2018

No, I did not but will do and let you know the results

parpar · May 21, 2018

It is running MrToddsFriends: 3 passes 0 errors (I know it is a slooow machine)....

MrToddsFriends · May 21, 2018

parpar said:
It is running MrToddsFriends: 3 passes 0 errors (I know it is a slooow machine)....

I don't expect further memtest findings after three passes with no errors.

parpar · May 22, 2018

Indeed, terminated at 4 passes 0 errors. Any other suggestions?

MrToddsFriends · May 22, 2018

parpar said:
Any other suggestions?

What happens if you keep the slave powered on for testing purposes, use the unmodified autorepl.py and enable all the replication tasks you disabled initially?

Important Announcement for the TrueNAS Community.

Replication results in deadlock (rrl->rr_) on zfs reads

parpar

Dabbler

MrToddsFriends

Documentation Browser

parpar

Dabbler

parpar

Dabbler

MrToddsFriends

Documentation Browser

parpar

Dabbler

MrToddsFriends

Documentation Browser

Similar threads