Replication Frequently Getting Stuck

David E · Jan 24, 2014

Hi All-
I have one zvol that contains a single child dataset. Recursive snapshots are configured on the zvol at multiple intervals. I have a second FreeNAS machine that is the target for a replication script set to recursively replicate all day. So far the snapshots of the zvol have been consistently kept up to date, but frequently the snapshots of the child dataset get stuck and stop replicating, and they contain the bulk of the data (what small amount it currently is) that is changing. I can manually 'catch them up', by issuing:

zfs send -R -I parent/child@auto-20140123.0943-2d parent/child@auto-20140123.1143-2d | ssh -p 50501 -i /data/ssh/replication backup1 zfs receive -F -d parent

With the proper stuck and current snapshot names... but I'm not sure why I should need to be doing this. Looking at /var/log/messages (which shows no errors btw), it looks like this stems from the replication task sending only incremental snapshots of the zvol, which in theory should contain the child dataset snapshots, but are somehow occasionally missing/falling behind, then are permanently stuck.

Any thoughts or ideas? I can write my own cron job to resolve this manually, but I'd prefer to fix the root of the system problem if possible.

dlavigne · Jan 24, 2014

If I'm understanding your setup correctly, I think you are hitting this limitation: you cannot have multiple periodic snapshots for the same dataset in ZFS replication, this is a current limitation.

David E · Jan 24, 2014

Thanks for responding, let me try and clarify to make sure you understand my setup, and I understand the limitation I am violating.

Volumes:
-1 volume 'parent' which is a stripe over two raidz arrays with 3 drives each
-4 child datasets (now), originally it was just 'child', I added Crashplan today so I have a jails dataset at the same level as child, and below it was added jails/crashplan_1 and jails/.warden-template-9.1-RELEASE-amd64-pluginjail

Periodic Snapshot Tasks:
-'parent' every hour, keep up to 2 days, recursive
-'parent' every day, keep up to a week, recursive
-'parent' every week, keep up to 6 months, recursive

Replication Tasks:
-'parent', recursively replicate

Currently on the master server the snapshots are all created and deleted periodically as expected. The slave is also consistently replicating all 'parent' snapshots, but I have observed at least on two occasions where one of the child datasets gets stuck on a particular snapshot, and does not continue to receive updated snapshots.

If I understand you correctly, you are saying that having multiple snapshot tasks for the same volume is the potential cause of this problem? If so can you elaborate on why?

Thanks!

David E · Feb 24, 2014

bump, thanks in advance.

toadman · Feb 25, 2014

I'm not sure multiple periodic snapshots on a dataset is supported by FreeNAS. It wasn't a year ago. https://bugs.freenas.org/issues/1646

What I do is use a script by a person named fracai. He maintains it at GitHub here: https://github.com/fracai/zfs-rollup/blob/master/rollup.py

The way to use this is:
(1) set a single periodic snapshot up for parent. Snapshop hourly and keep for 6 months.
(2) set a cronjob to run rollup.py. Have it trim your snapshots with "-hourly:24 -daily:7 -weekly:24"

That should give you 2 days worth of hourly snapshots, a week work of dailys, and 6 months worth of weeklies, just like you wanted.

On my system I do the pruning in the 23:00-23:59 period per day, then replicate in the 0:00-4:00 period to a backup server. Works flawlessly.

David E · Feb 25, 2014

Thanks this looks great, I'll give it a go.

Important Announcement for the TrueNAS Community.

Replication Frequently Getting Stuck

David E

Contributor

dlavigne

Guest

David E

Contributor

David E

Contributor

toadman

Guru

David E

Contributor

Similar threads