Replicated data missing after power failure

Status
Not open for further replies.

elangley

Contributor
Joined
Jun 4, 2012
Messages
109
Hello All,

In a testing environment remote (over WAN) replication is setup between several FreeNAS "PUSH" servers and one "PULL" server.

The PULL server experienced a power failure and restarted. After the restart there are various levels of replicated data issues;

1) one replicated volume is missing all of it's data folders and snapshots
2) other replicated volumes are missing pieces of data but the snapshots are there, meaning that I can at least clone one of them and get to valid data.

Needless to say this is disconcerting when thinking about using Replication for production.

I'd like to find out why this happened.

Your thoughts?

~eric
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm not sure what "various levels of replicated data issues" you are referring to, but:

1. You should have an UPS on your server.
2. In the event that a replication task is interrupted it should simply cause that snapshot to not replication on the PULL system and the next replication cycle will have more than 1 snapshot to send. Unfortunately ZFS does not currently support resuming a broken task. So a failed replication task should result in the incomplete snapshot not actually being commited to the pool and your PULL system should behave as if the task never happened.

Is #2 not what you experienced? If not, can you elaborate?
 

elangley

Contributor
Joined
Jun 4, 2012
Messages
109
This happened over a month ago but here are my replies,

dlavigne, Resumable zfs send/receive is still a WIP: https://www.illumos.org/issues/2605. Until that gets implemented, any interrupted replications need to be restarted from the beginning.

I note: I would expect that interrupted replications would need to be restarted from the beginning of the snapshot and would be fine with that. In this case ALL of the existing previously replicated data was lost.

cyberjock,

1. You should have an UPS on your server.
2. In the event that a replication task is interrupted it should simply cause that snapshot to not replication on the PULL system and the next replication cycle will have more than 1 snapshot to send. Unfortunately ZFS does not currently support resuming a broken task. So a failed replication task should result in the incomplete snapshot not actually being commited to the pool and your PULL system should behave as if the task never happened.

Is #2 not what you experienced? If not, can you elaborate?

I note: I do have a UPS, the power failure exceeded it's capacity to run the server.

#2 is not what I experienced. As previously noted ALL of the data was lost, even fully replicated snapshots.

Since that time that test FreeNAS server has been torn down so there are no logs to analyze why, it was what it was, a total loss of replicated data on the PULL server.

~eric
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok.. so obviously the power failure didn't lead to a shutdown of the Pull server. That's the whole reason why the NUT service is included with FreeNAS. The UPS, when on the battery, should be shutting down the FreeNAS box based on your settings. If your UPS isn't compatible with FreeBSD's NUT then you need to get a UPS that is compatible. The whole reason for getting an UPS is to allow for a graceful shutdown of your FreeNAS box on a loss of power.

So, since you're saying that your pool was basically gobbely-gook, can you post your hardware on the PULL server? I can tell you from personal experience that's exactly how it should have worked and that's actually exactly how it works. So now the question is what is wrong with your setup that would allow for the pool to get so messed up. A loss of power shouldn't cause what you are seeing, so I'm guessing that something was missed. I'll wait for the hardware specs to see what is going on.
 
Status
Not open for further replies.
Top