Recovering after failed replication

Integer · Jan 5, 2023

I recently built a new TrueNAS machine with the intent to migrate from an existing TrueNAS. In order to copy the data I created a snapshot manually and recursively on the root dataset of the old machine. Then I created a replication task on old machine. My intent was to copy all of the datasets recursively and keep the same dataset names, structure, and configuration on the new machine. For this task I used the root dataset as the source, `manual-%Y-%m-%d_%H-%M` as the snapshot naming schema, and checked both "Synchronize Destination Snapshots With Source" and "(Almost) Full Filesystem Replication". Everything seemed to be going alright until after several days the new machine suffered a power outage and the replication task running from the old machine failed.

One odd thing I noticed is that I had been monitoring the size of the pool on the new machine as a sort of progress indicator and after booting up I saw it has made about 60% of the way through the replication. But then refreshing the pools page repeatedly I kept seeing the size of the pool drop, eventually stopping at about 50% of the total size of the source. I don't understand what was happening there.

Anyway, what I'd really like to do now is resume my replication, have it pick up where it left off, and hopefully finish successfully. I was really afraid of the "Synchronize Destination Snapshots With Source" option deleting the progress I had made so I unchecked it and reran the task. This fails with an error "Last full ZFS replication failed to transfer snapshot". It says "Please run `zfs destroy storage@manual-2023-12-30_22-45` on the target system and run replication again." which I'm also afraid to do because I don't fully understand the implication in this case and don't want to lose all the data on the new pool or inadvertently re-copy all the same bytes.

sretalla · Jan 5, 2023

Integer said:
But then refreshing the pools page repeatedly I kept seeing the size of the pool drop, eventually stopping at about 50% of the total size of the source. I don't understand what was happening there.

Check progress by free space, not used, since snapshots also contain deletions (and the things that were deleted).

Integer said:
what I'd really like to do now is resume my replication, have it pick up where it left off, and hopefully finish successfully. I was really afraid of the "Synchronize Destination Snapshots With Source" option deleting the progress I had made so I unchecked it and reran the task. This fails with an error "Last full ZFS replication failed to transfer snapshot". It says "Please run `zfs destroy storage@manual-2023-12-30_22-45` on the target system and run replication again." which I'm also afraid to do because I don't fully understand the implication in this case and don't want to lose all the data on the new pool or inadvertently re-copy all the same bytes.

Replication works on units of snapshots, not below, so with a "partially transferred snapshot" as you currently have, that's rated as incomplete and not useful, so you have to eliminate it on the target side for it to be copied again.

You mentioned multiple child datasets, so perhaps some of them have completed and you only need to destroy the snapshot on the incomplete one...

You can start with zfs list -t snap on both sides and see what's there or not.

Integer · Jan 5, 2023

I deleted the snapshot listed by the error, which was the snapshot at the root dataset. Then I reran the job. It did what I was worried about. All the datasets on the the new machine except the root are now gone (and both used space has hit zero and free space effectively has reached the whole pool's size), and it appears to be replicating the whole thing from the beginning again. This isn't a data loss since it's all still on the old machine, but it's several days of lost time. Is there any way to recover now?

And if there isn't a way to recover, is there a better way to copy all my data than this? I'd hate to have it completely fail do to an interruption midway through again.

Integer · Jan 8, 2023

Just want to say I found a way to recover a bit. Since the dataset I was replicating had child datasets I was able to mount the datasets that finished replicating. But then instead of resuming the original replication I created new replication jobs for the datasets that didn't start or which were interrupted. I had to create the destination datasets manually so there would be a correct destination available. Maybe there was a cleaner way, but I didn't find it. I also found that ssh+netcat is way faster than ssh due to a mix of the protocol and not compressing (which you can disable in ssh mode as well, but I was using pigz initially).

Important Announcement for the TrueNAS Community.

Recovering after failed replication

Integer

Dabbler

sretalla

Powered by Neutrality

Integer

Dabbler

Integer

Dabbler

Similar threads