Replication Frequently Getting Stuck

Status
Not open for further replies.

David E

Contributor
Joined
Nov 1, 2013
Messages
119
Hi All-
I have one zvol that contains a single child dataset. Recursive snapshots are configured on the zvol at multiple intervals. I have a second FreeNAS machine that is the target for a replication script set to recursively replicate all day. So far the snapshots of the zvol have been consistently kept up to date, but frequently the snapshots of the child dataset get stuck and stop replicating, and they contain the bulk of the data (what small amount it currently is) that is changing. I can manually 'catch them up', by issuing:

zfs send -R -I parent/child@auto-20140123.0943-2d parent/child@auto-20140123.1143-2d | ssh -p 50501 -i /data/ssh/replication backup1 zfs receive -F -d parent

With the proper stuck and current snapshot names... but I'm not sure why I should need to be doing this. Looking at /var/log/messages (which shows no errors btw), it looks like this stems from the replication task sending only incremental snapshots of the zvol, which in theory should contain the child dataset snapshots, but are somehow occasionally missing/falling behind, then are permanently stuck.

Any thoughts or ideas? I can write my own cron job to resolve this manually, but I'd prefer to fix the root of the system problem if possible.
 
D

dlavigne

Guest
If I'm understanding your setup correctly, I think you are hitting this limitation: you cannot have multiple periodic snapshots for the same dataset in ZFS replication, this is a current limitation.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
Thanks for responding, let me try and clarify to make sure you understand my setup, and I understand the limitation I am violating.

Volumes:
-1 volume 'parent' which is a stripe over two raidz arrays with 3 drives each
-4 child datasets (now), originally it was just 'child', I added Crashplan today so I have a jails dataset at the same level as child, and below it was added jails/crashplan_1 and jails/.warden-template-9.1-RELEASE-amd64-pluginjail

Periodic Snapshot Tasks:
-'parent' every hour, keep up to 2 days, recursive
-'parent' every day, keep up to a week, recursive
-'parent' every week, keep up to 6 months, recursive

Replication Tasks:
-'parent', recursively replicate

Currently on the master server the snapshots are all created and deleted periodically as expected. The slave is also consistently replicating all 'parent' snapshots, but I have observed at least on two occasions where one of the child datasets gets stuck on a particular snapshot, and does not continue to receive updated snapshots.

If I understand you correctly, you are saying that having multiple snapshot tasks for the same volume is the potential cause of this problem? If so can you elaborate on why?

Thanks!
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
I'm not sure multiple periodic snapshots on a dataset is supported by FreeNAS. It wasn't a year ago. https://bugs.freenas.org/issues/1646

What I do is use a script by a person named fracai. He maintains it at GitHub here: https://github.com/fracai/zfs-rollup/blob/master/rollup.py

The way to use this is:
(1) set a single periodic snapshot up for parent. Snapshop hourly and keep for 6 months.
(2) set a cronjob to run rollup.py. Have it trim your snapshots with "-hourly:24 -daily:7 -weekly:24"

That should give you 2 days worth of hourly snapshots, a week work of dailys, and 6 months worth of weeklies, just like you wanted.

On my system I do the pruning in the 23:00-23:59 period per day, then replicate in the 0:00-4:00 period to a backup server. Works flawlessly.
 
Status
Not open for further replies.
Top