Replication Jobs Suddenly Started Failing After Working Flawlessly for Months

swarren

Cadet
Joined
Jun 19, 2019
Messages
6
Hello everyone,

The company I work for recently took on a new client that is using FreeNAS for their shared company data. I have never used FreeNAS, so I am a complete noob, and don't really know where to start on troubleshooting the replication issues they are suddenly having. I can't seem to locate any related logs that might point to a specific cause of the failure, and I searched through the forums, but didn't find anything I could verify was applicable to this situation. I appreciate any information that might point me in the right direction.

Here is the information I have:

Server 1:
Build FreeNAS-11.0-U4 (54848d13b)
Platform Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
Memory 8149MB
System Time Wed Jun 19 09:07:11 CDT 2019
Uptime 9:07AM up 12:39, 0 users
Load Average 1.23, 1.17, 1.10

Server 2:
Build FreeNAS-11.1-U2
Platform Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
Memory 8149MB
System Time Wed, 19 Jun 2019 09:07:04 -0500
Uptime 9:07AM up 1 day, 19:21, 0 users
Load Average 0.51, 0.31, 0.26

Here is the alert email I receive:
"Hello,
The replication failed for the local ZFS Files/Development while attempting to
apply incremental send of snapshot auto-20190614.1400-2w -> auto-20190614.1500-2w to 192.168.254.47"

"Hello,
The replication failed for the local ZFS Files/Public while attempting to
apply incremental send of snapshot auto-20190614.1400-2w -> auto-20190614.1430-2w to 192.168.254.47"

ALERT.PNG
Storage1.PNG
Storage2.PNG
Storage3.PNG


Server 1:
filesystem.PNG


Server 2:
filesystem2.PNG
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
You should look at the debug.log under /var/logs/ on your FreeNAS system running the replication.
 
Joined
Jul 3, 2015
Messages
926
'Not ran since boot' implies the primary system has been rebooted recently, does that sound right?

As the amount of data in Files/Development and Files/Public is so small I'd be tempted to remove all related snapshots to these datasets on the secondary system and then just let replication start afresh and sort itself out (which it's very good at doing).
 
Joined
Jul 3, 2015
Messages
926
Joined
Jul 3, 2015
Messages
926
Uptime 9:07AM up 1 day, 19:21, 0 users
The secondary also hasnt been up that long. Perhaps you were rebooting the servers to troubleshoot?
 
Joined
Jul 3, 2015
Messages
926
Ok, I just wanted to rule out any random reboots.

Hard to tell what the exact cause is but like I said above with it being such a small amount of data I'd just manually purge the snaps for those two datasets on the secondary systems and then your primary will send them all over again at the next snap interval.
 

swarren

Cadet
Joined
Jun 19, 2019
Messages
6
'Not ran since boot' implies the primary system has been rebooted recently, does that sound right?

As the amount of data in Files/Development and Files/Public is so small I'd be tempted to remove all related snapshots to these datasets on the secondary system and then just let replication start afresh and sort itself out (which it's very good at doing).

Thanks Johnny Fartpants. What is the best way to go about this? Destroy the entire Files/Public Files/Development datasets? Or destroy only the snapshots?
 

swarren

Cadet
Joined
Jun 19, 2019
Messages
6
You should look at the debug.log under /var/logs/ on your FreeNAS system running the replication.

Looks like "destination already exists" is the error. What could be the cause of that? Anyway to avoid it happening again in the future?
 

Attachments

  • debug.zip
    38.3 KB · Views: 344
Joined
Jul 3, 2015
Messages
926
Thanks Johnny Fartpants. What is the best way to go about this? Destroy the entire Files/Public Files/Development datasets? Or destroy only the snapshots?
Jump onto the secondary box and under the snapshots tab you can filter based on name. Delete all the snapshots relating to Development and Public. You shouldn't have to delete the datasets.
 
Joined
Jul 3, 2015
Messages
926
Looks like "destination already exists" is the error. What could be the cause of that? Anyway to avoid it happening again in the future?
Be worth looking to see if the secondary already has a snapshot with that name/date/time as the error suggests its trying to send a snapshot that already exists.
 

swarren

Cadet
Joined
Jun 19, 2019
Messages
6
Be worth looking to see if the secondary already has a snapshot with that name/date/time as the error suggests its trying to send a snapshot that already exists.

The snapshot does exist on the secondary server. Could it be because there are also snapshots being taken on the secondary server?
 
Joined
Jul 3, 2015
Messages
926
Shouldn't be but you can check that in the scheduled snapshot section.

Assuming not it could have just got its knickers in a twist for some reason.
 

swarren

Cadet
Joined
Jun 19, 2019
Messages
6
Shouldn't be but you can check that in the scheduled snapshot section.

Assuming not it could have just got its knickers in a twist for some reason.

There were "Periodic Snapshot Tasks" configured on the secondary server. I destroyed the snapshots for Public and Development, and disabled the snapshot tasks on the secondary server. I will report back how it goes over the next few days. Thanks for your guidance!
 
Joined
Jul 3, 2015
Messages
926
There were "Periodic Snapshot Tasks" configured on the secondary server. I destroyed the snapshots for Public and Development, and disabled the snapshot tasks on the secondary server. I will report back how it goes over the next few days. Thanks for your guidance!
Ah, that could well be your issue then. Best of luck
 
Top