I need some help troubleshooting a deadlock on by backup machine. During its incremental replication all processes accessing the data set end up in a lock on rll->rr_ as reported by a
Set up:
master is doing periodic snapshots (multiple per day, keeping snapshots for multiple weeks to a year). All its replication tasks point to slave but are disabled (as slave is normally off-line)
Once a week, master sends a wake-on-lan to slave, waits for slave to boot and starts a slightly modified autorepl.py (modified to ignore the 'enabled' flag on the replication tasks, so it forcefully replicates, regardless of the flag.
Result:
On a specific data set, the replication tasks stays in Status "Sending" on master. On slave the task
Noteworthy:
Killing the entire dataset and pool on slave results -under the same conditions- in a successful replication (no hangs). On incremental replications it is reproducible on the same dataset, however that dataset seems to have no problems. The dataset did not change over the weekly replication interval, so the snapshots are empty.
Hardware:
Master:
Build FreeNAS-11.1-U4
Platform Intel(R) Atom(TM) CPU C2750 @ 2.41GHz
Memory 32703MB
Disks: 4x3TB striped mirror (3TB WD RED)
Slave:
Build FreeNAS-11.1-U4
Platform AMD Turion(tm) II Neo N40L Dual-Core Processor
Geheugen 8121MB
Disks: 4x2TB Raid-Z1 (3x Seagate ST32000542AS + 1x TOSHIBA DT01ACA200)
Any help to further troubleshoot is greatly appreciated!
ps -axHl -O lwp
in the nwchan field.Set up:
master is doing periodic snapshots (multiple per day, keeping snapshots for multiple weeks to a year). All its replication tasks point to slave but are disabled (as slave is normally off-line)
Once a week, master sends a wake-on-lan to slave, waits for slave to boot and starts a slightly modified autorepl.py (modified to ignore the 'enabled' flag on the replication tasks, so it forcefully replicates, regardless of the flag.
Result:
On a specific data set, the replication tasks stays in Status "Sending" on master. On slave the task
/sbin/zfs receive -F -d
seems stalled and processes accessing the data set get into a locked state as mentioned (Python 3.6, collectd or a simple ls on the dataset), for exampleCode:
UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND PID LWP TT STAT TIME COMMAND 0 213 211 0 20 0 265512 235112 rrl->rr_ D - 0:53.81 python3.6: middl 213 101149 - D 0:53.81 python3.6: middlewared (python3.6) 0 251 233 0 20 0 6952 2680 rrl->rr_ D+ 2 0:00.00 ls -la /mnt/zdat 251 101915 2 D+ 0:00.00 ls -la /mnt/zdata
Noteworthy:
Killing the entire dataset and pool on slave results -under the same conditions- in a successful replication (no hangs). On incremental replications it is reproducible on the same dataset, however that dataset seems to have no problems. The dataset did not change over the weekly replication interval, so the snapshots are empty.
Hardware:
Master:
Build FreeNAS-11.1-U4
Platform Intel(R) Atom(TM) CPU C2750 @ 2.41GHz
Memory 32703MB
Disks: 4x3TB striped mirror (3TB WD RED)
Slave:
Build FreeNAS-11.1-U4
Platform AMD Turion(tm) II Neo N40L Dual-Core Processor
Geheugen 8121MB
Disks: 4x2TB Raid-Z1 (3x Seagate ST32000542AS + 1x TOSHIBA DT01ACA200)
Any help to further troubleshoot is greatly appreciated!