Snapshots are not replicated

Status
Not open for further replies.

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
I have setup two FreeNAS systems. One is the system users interact with. It has one volume and several datasets. Every volume/dataset has its own (non-recursive) periodic snapshot task. I have one recursive replication task defined for the volume, and it is replicated to the second (backup) system.

I recently added a new dataset to the first system. Unfortunately, this dataset was not replicated to the second system. Only after some manual creation of files on the volume, the dataset occured on the second system. However, the dataset on the second system is empty.

Snapshots of the new dataset are created on the first system automatically. autorepl.py seems to initiate replication jobs of the volume. But the snapshots of the dataset do not seem to be propagating to the second system. I cannot find any problems in the log (/var/log/messages) file.

Does anyone has any idea what happened, why this happened and how I can fix it?
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
Correction: The dataset did not show on the second system at all, not even 'after some manual creation of files on the volume'. Instead, the (empty) directory showed up, but the dataset that should be mounted there did not. So, I manually added the new dataset to the backup system. However, this dataset got removed from the backup system as soon as the next replication took place.

I guess somewhere I did not follow the instructions correctly (which makes sense, because I don't have any). I appreciate any suggestions!
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
New update: It turns out that other datasets are not replicated either - they are stuck some months ago. Note that I don't get any error message: I received no email (except for the nightly status emails) and I can't find anything in the logs either.

Of course, I can just reinitialize the replication. But without knowing what happened, I am worried that I run into the same problem in the future. And then I'll probably only realize that I'm far behind in replicating snapshots because I really need the backup...
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
I keep talking to myself: I enabled debug in the autorepl script (now called autorepl2.py on my system), enabled the debug.log file in syslogd and watched the debug.log file for a while. I then saw:

Code:
Mar 20 16:10:01 freenas alert.py: [middleware.notifier:167] Popen()ing: zpool list -H -o health Internal
Mar 20 16:10:01 freenas autosnap.py: [tools.autosnap:42] Popen()ing: /sbin/zfs list -t snapshot -H
Mar 20 16:10:01 freenas alert.py: [middleware.notifier:3869] sysctlbyname: kern.geom.confxml
Mar 20 16:10:03 freenas autorepl2.py: [tools.autorepl:111] Autosnap replication started
Mar 20 16:10:03 freenas autorepl2.py: [tools.autorepl:112] temp log file: /tmp/repl-12782
Mar 20 16:10:03 freenas autorepl2.py: [tools.autorepl:159] Checking dataset Internal
Mar 20 16:10:03 freenas autorepl2.py: [tools.autorepl:185] Snapshot: Internal@auto-20130320.1610-1y State: NEW
Mar 20 16:10:03 freenas autorepl2.py: [tools.autorepl:192] Snapshot Internal@auto-20130320.1610-1y added to wanted list
Mar 20 16:10:03 freenas autorepl2.py: [tools.autorepl:185] Snapshot: Internal@auto-20130320.1534-1y State: LATEST
Mar 20 16:10:03 freenas autorepl2.py: [tools.autorepl:189] Snapshot Internal@auto-20130320.1534-1y is the recorded latest snapshot
Mar 20 16:10:03 freenas autorepl2.py: [tools.autorepl:221] Found matching latest snapshot Internal@auto-20130320.1534-1y remotely
Mar 20 16:10:07 freenas autorepl2.py: [tools.autorepl:285] Replication result: WARNING: could not send Internal/Clinical@auto-20130320.1610-1y: does not exist WARNING: could not send Internal/Clinical/2008-02_Patch@auto-20130320.1610-1y: does not exist WARNING: could not send Internal/Clinical/2008-02_AMD@auto-20130320.1610-1y: does not exist WARNING: could not send Internal/HWexp@auto-20130320.1610-1y: does not exist WARNING: could not send Internal/HWexp/Doppler@auto-20130320.1610-1y: does not exist WARNING: could not send Internal/HWexp/TSLO@auto-20130320.1610-1y: does not exist Succeeded.

I can confirm that the mentioned snapshots do not exist. As mentioned, my snapshots are set up for every dataset individually. My guess is now that the recursive replication assumes that the snapshots are also recursive. Is this correct? Did I miss this in the documentation?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm posting just so you know people are reading it. Unfortunately, my only advice is to read and follow the section of the manual for snapshots and replication. Setting up replication and snapshots is not trivial. There are plenty of places where a misconfigured setting can cause the same error. So trying to provide help via the forum is quite difficult.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
Thanks for the notice that I'm not all alone here! There are at least two problems, I guess: 1. Replication doesn't seem to work. 2. I wasn't aware of that until recently, because freenas didn't report it.

Anyway, I checked the manual again, but I couldn't find any part that addresses this. The manual just goes over the options, gives an example of how to setup thing for a single dataset, and that's it. However, it seems that the replication process, when selecting the recursive option, assumes that all snapshots were created at the same time. That is obviously true for recursive snapshots, but not for normal ones. I'm not sure if this is a freenas or a zfs limitation (or, for that matter, whether my analysis is actually correct).

I'm doing some more debugging from within the autorepl script, to check what's the actual zfs send command that gives this error.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
It seems that it's indeed the zfs send command that just looks for snapshots with the same timestamp. In my case, they don't exist, because the snapshots are not taken recursively and therefore their timestamps may be different.

I considered two options: Creating recursive snapshots and doing recursive replication, or creating separate snapshots and separate replication tasks. I opted for the second one. It gives me more flexibility, although I have to specify a snapshot task and replication task for every dataset.

Problem solved? Not really: This doesn't seem to be documented very well, and I'm still not sure whether I'm now taking the right approach. But if I don't encounter new problems over the next few days (it's now replicating a few hundred snapshots), I feel comfortable with it.
 

kavermeer

Explorer
Joined
Oct 10, 2012
Messages
59
So far so good: Setting up separate replication tasks for all datasets seem to have fixed my problems. All snapshots are sent to the backup server as expected.

One disadvantage of this is that snapshots that are deleted from the main system will still be present on the backup system. I realize that in some cases that may be a benefit, but not for me: I'd rather have a subset of snapshots on the backup system (e.g., have hourly snapshots on the main system, but only daily snapshots on the backup system).
 
Status
Not open for further replies.
Top