Replication results in deadlock (rrl->rr_) on zfs reads

Status
Not open for further replies.

parpar

Dabbler
Joined
Feb 10, 2013
Messages
15
I need some help troubleshooting a deadlock on by backup machine. During its incremental replication all processes accessing the data set end up in a lock on rll->rr_ as reported by a ps -axHl -O lwp in the nwchan field.

Set up:
master is doing periodic snapshots (multiple per day, keeping snapshots for multiple weeks to a year). All its replication tasks point to slave but are disabled (as slave is normally off-line)
Once a week, master sends a wake-on-lan to slave, waits for slave to boot and starts a slightly modified autorepl.py (modified to ignore the 'enabled' flag on the replication tasks, so it forcefully replicates, regardless of the flag.

Result:
On a specific data set, the replication tasks stays in Status "Sending" on master. On slave the task
/sbin/zfs receive -F -d seems stalled and processes accessing the data set get into a locked state as mentioned (Python 3.6, collectd or a simple ls on the dataset), for example
Code:
  UID   PID  PPID CPU  PRI NI	VSZ	RSS MWCHAN   STAT TT	   TIME COMMAND			PID	LWP TT  STAT	  TIME COMMAND
	0   213   211   0   20  0 265512 235112 rrl->rr_ D	 -	0:53.81 python3.6: middl   213 101149  -  D	  0:53.81 python3.6: middlewared (python3.6)
	0   251   233   0   20  0   6952   2680 rrl->rr_ D+	2	0:00.00 ls -la /mnt/zdat   251 101915  2  D+	 0:00.00 ls -la /mnt/zdata


Noteworthy:
Killing the entire dataset and pool on slave results -under the same conditions- in a successful replication (no hangs). On incremental replications it is reproducible on the same dataset, however that dataset seems to have no problems. The dataset did not change over the weekly replication interval, so the snapshots are empty.

Hardware:
Master:
Build FreeNAS-11.1-U4
Platform Intel(R) Atom(TM) CPU C2750 @ 2.41GHz
Memory 32703MB
Disks: 4x3TB striped mirror (3TB WD RED)

Slave:
Build FreeNAS-11.1-U4
Platform AMD Turion(tm) II Neo N40L Dual-Core Processor
Geheugen 8121MB
Disks: 4x2TB Raid-Z1 (3x Seagate ST32000542AS + 1x TOSHIBA DT01ACA200)

Any help to further troubleshoot is greatly appreciated!
 

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
Did you ever run a memtest for several days on the slave?
 

parpar

Dabbler
Joined
Feb 10, 2013
Messages
15
It is running MrToddsFriends: 3 passes 0 errors (I know it is a slooow machine)....
 

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
It is running MrToddsFriends: 3 passes 0 errors (I know it is a slooow machine)....

I don't expect further memtest findings after three passes with no errors.
 

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
Any other suggestions?

What happens if you keep the slave powered on for testing purposes, use the unmodified autorepl.py and enable all the replication tasks you disabled initially?
 
Status
Not open for further replies.
Top