replication issues with first dataset after adding second dataset

FFF

Dabbler
Joined
Mar 10, 2017
Messages
28
Hi all,
I've got a pair of identical FreeNAS servers running 11.1-U2 with one acting as a the primary and the other acting as a replication target for snapshots. For the past few months, replication has been failing for the original dataset (started with a 40TB quota but has since been increased to 57TB). Around the time of the first quota increase on the first dataset, a second dataset was created (started with a 10TB quota but has since been increased to 20TB). I suspect one or both of these actions (quota increases and/or creation of the second dataset) is to blame for the original dataset failing to replication successfully. During replication attempts, I see alerts for the storage on the target volume rising consistently until the replication attempt finally fails. I'm hoping someone can help validate this theory or point me in the direction of the true culprit and a way to overcome the issue. Attached are reporting graphs from both servers. I'm not sure if it's relevant, but the target server shows a separate graph for the second dataset, but not the original one. I suppose there could be expected reasons for this that would result in it being a red herring (e.g. due to the fact the original dataset has failed to replicate over for enough time that none of its snapshots are present on the target volume) but the dataset itself does show up in the storage view so I figured I'd mention it just in case it's helpful. Thanks in advance.
 

Attachments

  • 01.primary-RAIDZ3-01.png
    01.primary-RAIDZ3-01.png
    14.9 KB · Views: 309
  • 02.primary-RAIDZ3-01-original-dataset.png
    02.primary-RAIDZ3-01-original-dataset.png
    23.2 KB · Views: 309
  • 03.primary-RAIDZ3-01-second-dataset.png
    03.primary-RAIDZ3-01-second-dataset.png
    20.9 KB · Views: 321
  • 04.replication-target-RAIDZ3-01.png
    04.replication-target-RAIDZ3-01.png
    15.1 KB · Views: 292
  • 05.repllication-target-RAIDZ3-01-replication-target.png
    05.repllication-target-RAIDZ3-01-replication-target.png
    16.3 KB · Views: 306
  • 06.replication-target-RAIDZ3-01-replication-dataset2.png
    06.replication-target-RAIDZ3-01-replication-dataset2.png
    24.3 KB · Views: 310
Joined
Jan 18, 2017
Messages
525
To me it looks like you've run out of space can you post the output in code tags of
Code:
zfs list -o name,quota,refquota,reservation,refreservation
 

FFF

Dabbler
Joined
Mar 10, 2017
Messages
28
I was thinking this as well based on the graphs, I just haven't been able to wrap my head around why it would be running out of space since storage on both hosts is supposed to be identical. Find the output below for both hosts.

Primary:
NAME QUOTA REFQUOTA RESERV REFRESERV
RAIDZ3-01 none none none none
RAIDZ3-01/.system none none none none
RAIDZ3-01/.system/configs-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/.system/cores none none none none
RAIDZ3-01/.system/rrd-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/.system/samba4 none none none none
RAIDZ3-01/.system/syslog-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/NewDataset none 20T none none
RAIDZ3-01/OrigDataset none 57T none none
RAIDZ3-01/jails none none none none
freenas-boot none none none none
freenas-boot/ROOT none none none none
freenas-boot/ROOT/Initial-Install none none none none
freenas-boot/ROOT/default none none none none
freenas-boot/ROOT/default-20180227-022915 none none none none
freenas-boot/grub none none none none

Replication target:
NAME QUOTA REFQUOTA RESERV REFRESERV
RAIDZ3-01 none none none none
RAIDZ3-01/.system none none none none
RAIDZ3-01/.system/configs-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/.system/cores none none none none
RAIDZ3-01/.system/rrd-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/.system/samba4 none none none none
RAIDZ3-01/.system/syslog-7c35bc62b22f460fb3766e1c156d5c44 none none none none
RAIDZ3-01/Replication_Target01 none none none none
RAIDZ3-01/Replication_Target01/NewDataset none 20T none none
RAIDZ3-01/Replication_Target01/OrigDataset none 55T none none
RAIDZ3-01/jails none none none none
RAIDZ3-01/jails/.warden-template-pluginjail-10.3-x64 none none none none
freenas-boot none none none none
freenas-boot/ROOT none none none none
freenas-boot/ROOT/Initial-Install none none none none
freenas-boot/ROOT/default none none none none
freenas-boot/ROOT/default-20180226-020909 none none none none
freenas-boot/grub none none none none
 
Joined
Jan 18, 2017
Messages
525
I have had a replication do this when an offsite/offline copy got too far out of sync and had to recopy everything, I compared the snapshots dates and there were zero matching so I cleared the destination pool and started over. As for what caused this I'm not experienced enough to comment hopefully someone better versed chimes in.
 

FFF

Dabbler
Joined
Mar 10, 2017
Messages
28
Can you elaborate on how you cleared the destination pool to start over? I have previously tried something similar (or perhaps even identical) by wiping out all the snapshots for the original dataset on the replication target server in the hopes that would free up enough resources to allow it to complete a fresh backup, but that doesn't seem to have worked. I've also considered wiping out the Replication_Target01 dataset and re-creating it, but have been holding on to that as a last resort as then it will also have to re-replicate the all of the data in the NewDataset that is currently succeeding and up to date.
 
Joined
Jan 18, 2017
Messages
525
I deleted the datasets and when I enabled the replication again it recreated them and transferred all the data again. AFAIK once it gets out of sync to that extent it has to transfer the new snaps and all the data again unlike RSYNC which will compare the data in the datasets, I could be mistaken however. In order to make sure I don't have to do that again I simply increased the length of time it keeps the snapshots.
 

FFF

Dabbler
Joined
Mar 10, 2017
Messages
28
Maybe I'll try that after the current replication attempt finishes failing, which should be pretty soon as I can see the replication target is almost out of space. And now that I've caught it near the end of the process, I can definitely see the issue must be it running out of space. I still don't get why the OriginalDataset ends up being so much larger on the target than it is on the source. Any ideas what would cause that?

Primary:
NAME USED AVAIL REFER MOUNTPOINT
RAIDZ3-01/OriginalDataSet 55.2T 1.87T 55.1T /mnt/RAIDZ3-01/OriginalDataSet

Replication Target:
NAME USED AVAIL REFER MOUNTPOINT
RAIDZ3-01/Replication_Target01/OriginalDataSet 72.2T 553G 38.4T /mnt/RAIDZ3-01/Replication_Target01/OriginalDataSet
 

FFF

Dabbler
Joined
Mar 10, 2017
Messages
28
Still waiting for it to finish failing ... down to ~30GB free on the replication target. If anyone has ideas why it would appear to only require ~55TB to replicate based on the source dataset but instead seems to require upwards of 73TB (the primary host claims it's currently @ 70% completed) and how to address that please let me know.
 

FFF

Dabbler
Joined
Mar 10, 2017
Messages
28
Now that it's failed and I had disabled replication for that data set so it wouldn't retry, I'm noticing something else that seems weird and potentially relevant. The error message displayed on the primary is:

Failed: RAIDZ3-01/OriginalDataset (auto-20190406.0000-5w)

However, that is not a snapshot that exists on the primary or the replication target anymore. After I destroyed the OriginalDataset on the replication target system and re-started the replication it is now trying to replicate auto-20190427.0000-5w which seems much more sane as that is the oldest snapshot that exists on the primary system. Fingers crossed that this does the trick.
 
Joined
Jan 18, 2017
Messages
525
If this replication fails we will REALLY need someone more knowledgeable to look at this for you, after you destroyed the dataset did it begin freeing space on the pool?
 

FFF

Dabbler
Joined
Mar 10, 2017
Messages
28
It did start freeing up space in the pool. It's up to 50TB and still increasing. I suspect this one will succeed, though it would still be nice to know how it got into this state in case it's a bug that needs to be addressed. However, I think your point about needing someone more knowledgeable applies in that case as well.
 
Top