replication issues with first dataset after adding second dataset

FFF · May 28, 2019

Hi all,
I've got a pair of identical FreeNAS servers running 11.1-U2 with one acting as a the primary and the other acting as a replication target for snapshots. For the past few months, replication has been failing for the original dataset (started with a 40TB quota but has since been increased to 57TB). Around the time of the first quota increase on the first dataset, a second dataset was created (started with a 10TB quota but has since been increased to 20TB). I suspect one or both of these actions (quota increases and/or creation of the second dataset) is to blame for the original dataset failing to replication successfully. During replication attempts, I see alerts for the storage on the target volume rising consistently until the replication attempt finally fails. I'm hoping someone can help validate this theory or point me in the direction of the true culprit and a way to overcome the issue. Attached are reporting graphs from both servers. I'm not sure if it's relevant, but the target server shows a separate graph for the second dataset, but not the original one. I suppose there could be expected reasons for this that would result in it being a red herring (e.g. due to the fact the original dataset has failed to replicate over for enough time that none of its snapshots are present on the target volume) but the dataset itself does show up in the storage view so I figured I'd mention it just in case it's helpful. Thanks in advance.

cobrakiller58 · May 30, 2019

To me it looks like you've run out of space can you post the output in code tags of

Code:

zfs list -o name,quota,refquota,reservation,refreservation

FFF · May 30, 2019

I was thinking this as well based on the graphs, I just haven't been able to wrap my head around why it would be running out of space since storage on both hosts is supposed to be identical. Find the output below for both hosts.

Primary:

NAME QUOTA REFQUOTA RESERV REFRESERV
RAIDZ3-01 none none none none
RAIDZ3-01/.system none none none none
RAIDZ3-01/.system/configs-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/.system/cores none none none none
RAIDZ3-01/.system/rrd-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/.system/samba4 none none none none
RAIDZ3-01/.system/syslog-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/NewDataset none 20T none none
RAIDZ3-01/OrigDataset none 57T none none
RAIDZ3-01/jails none none none none
freenas-boot none none none none
freenas-boot/ROOT none none none none
freenas-boot/ROOT/Initial-Install none none none none
freenas-boot/ROOT/default none none none none
freenas-boot/ROOT/default-20180227-022915 none none none none
freenas-boot/grub none none none none

Replication target:

NAME QUOTA REFQUOTA RESERV REFRESERV
RAIDZ3-01 none none none none
RAIDZ3-01/.system none none none none
RAIDZ3-01/.system/configs-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/.system/cores none none none none
RAIDZ3-01/.system/rrd-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/.system/samba4 none none none none
RAIDZ3-01/.system/syslog-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/Replication_Target01 none none none none
RAIDZ3-01/Replication_Target01/NewDataset none 20T none none
RAIDZ3-01/Replication_Target01/OrigDataset none 55T none none
RAIDZ3-01/jails none none none none
RAIDZ3-01/jails/.warden-template-pluginjail-10.3-x64 none none none none
freenas-boot none none none none
freenas-boot/ROOT none none none none
freenas-boot/ROOT/Initial-Install none none none none
freenas-boot/ROOT/default none none none none
freenas-boot/ROOT/default-20180226-020909 none none none none
freenas-boot/grub none none none none

cobrakiller58 · May 30, 2019

I have had a replication do this when an offsite/offline copy got too far out of sync and had to recopy everything, I compared the snapshots dates and there were zero matching so I cleared the destination pool and started over. As for what caused this I'm not experienced enough to comment hopefully someone better versed chimes in.

FFF · May 30, 2019

Can you elaborate on how you cleared the destination pool to start over? I have previously tried something similar (or perhaps even identical) by wiping out all the snapshots for the original dataset on the replication target server in the hopes that would free up enough resources to allow it to complete a fresh backup, but that doesn't seem to have worked. I've also considered wiping out the Replication_Target01 dataset and re-creating it, but have been holding on to that as a last resort as then it will also have to re-replicate the all of the data in the NewDataset that is currently succeeding and up to date.

cobrakiller58 · May 30, 2019

I deleted the datasets and when I enabled the replication again it recreated them and transferred all the data again. AFAIK once it gets out of sync to that extent it has to transfer the new snaps and all the data again unlike RSYNC which will compare the data in the datasets, I could be mistaken however. In order to make sure I don't have to do that again I simply increased the length of time it keeps the snapshots.

FFF · May 30, 2019

Maybe I'll try that after the current replication attempt finishes failing, which should be pretty soon as I can see the replication target is almost out of space. And now that I've caught it near the end of the process, I can definitely see the issue must be it running out of space. I still don't get why the OriginalDataset ends up being so much larger on the target than it is on the source. Any ideas what would cause that?

Primary:

NAME USED AVAIL REFER MOUNTPOINT
RAIDZ3-01/OriginalDataSet 55.2T 1.87T 55.1T /mnt/RAIDZ3-01/OriginalDataSet

Replication Target:

NAME USED AVAIL REFER MOUNTPOINT
RAIDZ3-01/Replication_Target01/OriginalDataSet 72.2T 553G 38.4T /mnt/RAIDZ3-01/Replication_Target01/OriginalDataSet

FFF · May 30, 2019

Still waiting for it to finish failing ... down to ~30GB free on the replication target. If anyone has ideas why it would appear to only require ~55TB to replicate based on the source dataset but instead seems to require upwards of 73TB (the primary host claims it's currently @ 70% completed) and how to address that please let me know.

FFF · May 30, 2019

Now that it's failed and I had disabled replication for that data set so it wouldn't retry, I'm noticing something else that seems weird and potentially relevant. The error message displayed on the primary is:

Failed: RAIDZ3-01/OriginalDataset (auto-20190406.0000-5w)

However, that is not a snapshot that exists on the primary or the replication target anymore. After I destroyed the OriginalDataset on the replication target system and re-started the replication it is now trying to replicate auto-20190427.0000-5w which seems much more sane as that is the oldest snapshot that exists on the primary system. Fingers crossed that this does the trick.

cobrakiller58 · May 30, 2019

If this replication fails we will REALLY need someone more knowledgeable to look at this for you, after you destroyed the dataset did it begin freeing space on the pool?

FFF · May 30, 2019

It did start freeing up space in the pool. It's up to 50TB and still increasing. I suspect this one will succeed, though it would still be nice to know how it got into this state in case it's a bug that needs to be addressed. However, I think your point about needing someone more knowledgeable applies in that case as well.

Important Announcement for The TrueNAS Community.

replication issues with first dataset after adding second dataset

FFF

Dabbler

Attachments

cobrakiller58

Guru

FFF

Dabbler

cobrakiller58

Guru

FFF

Dabbler

cobrakiller58

Guru

FFF

Dabbler

FFF

Dabbler

FFF

Dabbler

cobrakiller58

Guru

FFF

Dabbler

Similar threads

Important Announcement for The TrueNAS Community.