Help in understanding snapshot replication

jlw52761 · Mar 16, 2022

I'm looking at a DR strategy where I have two TrueNAS SCALE devices at two locations. The primary location serves a dataset as a NFS share, and I am snapping that every 5 minutes. I have a Docker Swarm that uses the NFS for persistent storage, and all that works great today.

The DR strategy I am looking at is to replicate the snapshot from the primary location to the TrueNAS SCALE device as the backup location, which I am currently doing, and I also have nodes in the Docker swarm at that site in a DRAIN state. In that state, they don't take any container load. Now both sites have a leader node in the Swarm, so no worries there, but what I am wanting to do, is that when detecting a failure of the primary site, set the nodes at the backup site to AVAILABLE via a script, meaning they will start any services/containers that aren't running, according to the rules of the service, and connect to the NFS share at the backup location, since that's where they are at.

All this is easy, but, what I am noticing is that at the backup site, if I look at the mount for the replicated snapshot, the data is only from the oldest snapshot, not the latest, which is what I was hoping for. My desire is, that when the new snapshot comes in every 5 minutes, that the mount reflects the new data. I know that when the primary site goes down I will have to disable the replication job and manually snapshot the dataset in the backup location and manually replicate back, that's fine, but was hoping that the "pointer" would automatically move forward on the dataset in the backup site as the new snapshot came in.

If that's not the behavior, fine, I just need to add extra logic to the script to first "restore" to the latest snapshot, I'm mainly wanting to know if what I am observing is normal behavior or if what I desire is the normal and what I'm observing is due to some setting somewhere I flipped without realizing.

My second question is, if I did create a file in the backup location for any reason, and then "restored" the replicated snapshot from the primary location, is that file gone because it didn't exist in the snapshot, or is that file still there, but the data in the snapshot is merged with the dataset?

sretalla · Mar 17, 2022

jlw52761 said:
if I did create a file in the backup location for any reason, and then "restored" the replicated snapshot from the primary location, is that file gone because it didn't exist in the snapshot, or is that file still there, but the data in the snapshot is merged with the dataset?

The file is gone. (although it's probably still on the disk and may be able to be found by a recovery tool if you stop all further writes to the disk)

Rolling back a snapshot isn't the preferred way to get data back, particularly if you only want some files to go back... use the .zfs directory and the browsable snapshots under it to find the appropriate file(s) and copy them back to the desired location.

jlw52761 said:
if I look at the mount for the replicated snapshot, the data is only from the oldest snapshot, not the latest

That's possibly a reflection of the way you're mounting it... since snapshots are being replicated in under the hood, maybe the platform you're mounting it on isn't getting the message that something needs to be re-checked.

Redcoat · Mar 17, 2022

sretalla said:
since snapshots are being replicated in under the hood,

@sretalla, for my education, can you please explain the meaning/background to the above cut from your response?
TIA.

sretalla · Mar 17, 2022

Redcoat said:
please explain the meaning/background to the above cut from your response?

Dataset (or pool and a bunch of child datasets) replicated.

Snapshots replicate incrementally to the target one by one as they are created on the source. (which means the target copy has the original and all subsequent snapshots).

Depending on how the mount has been done, it's possible that nothing tells the target system to "update" the mount to take into account the snapshots as they arrive. (which is the point about "under the hood"... the process working on the mount is just driving the car... no idea we added extra cylinders to the engine... maybe a bad analogy)

Redcoat · Mar 17, 2022

Thank you - it was indeed the analogy that had me wondering if I didn't understand the point you were making.

jlw52761 · Mar 17, 2022

sretalla said:
The file is gone. (although it's probably still on the disk and may be able to be found by a recovery tool if you stop all further writes to the disk)

Rolling back a snapshot isn't the preferred way to get data back, particularly if you only want some files to go back... use the .zfs directory and the browsable snapshots under it to find the appropriate file(s) and copy them back to the desired location.

That's possibly a reflection of the way you're mounting it... since snapshots are being replicated in under the hood, maybe the platform you're mounting it on isn't getting the message that something needs to be re-checked.

I was actually looking on the TrueNAS system itself to eliminate any oddities with NFS. So I ended up doing a little test, getting snapshot replication all nice and happy, I goofed it up. With the replication occurring, I created a file on the new filer, and the next replication the new file was gone. That pretty much answered two question I had at once, replicating a snapshot will "move the pointer" on the destination to the latest snapshot, and restoring a snapshot will not merge data, but rather make the dataset/zvol identical to the snapshot, which is what I was expecting.
For restores, yes, I typically would not restore the entire snapshot, but rather either clone it to a new dataset/zvol and copy only the files I need, or navigate into the snapshot directory and copy the files that way.
I was more curious about the DR situation I described, in which I believe that I am on track with my expectations, within the replication interval as a RPO. So if my snapshot and replication intervals is 10 minutes, I could have either 10 minutes or 20 minutes of missing data, depending on if the latest snapshot happened prior to the latest replication. Since the snapshots for 15 minutes is a few MB at best, maybe 100MB, I may do 10 minutes snapshots and 5 minute replications. Either way, I do know that if I had to do the DR, part of my script will need to be stopping the replication from the primary filer to the secondary filer so that the new data at the secondary site doesn't get trashed with the next replication. That's gonna be some API calls, which I believe TrueNAS has a public API, so that should be fun.

jlw52761 · Mar 17, 2022

sretalla said:
Dataset (or pool and a bunch of child datasets) replicated.

Snapshots replicate incrementally to the target one by one as they are created on the source. (which means the target copy has the original and all subsequent snapshots).

Depending on how the mount has been done, it's possible that nothing tells the target system to "update" the mount to take into account the snapshots as they arrive. (which is the point about "under the hood"... the process working on the mount is just driving the car... no idea we added extra cylinders to the engine... maybe a bad analogy)

Very good analogy actually. The way I'm going to deal with this is to use AutoFS, so that the mount doesn't go active until a IO request is sent to it, so when the secondary site is in it's standby state, the NFS would actually be mounted, but when it get's spun up, the IO will bring the mount online with the (hopefully) right data from the latest snapshot.

Important Announcement for the TrueNAS Community.

Help in understanding snapshot replication

jlw52761

Explorer

sretalla

Powered by Neutrality

Redcoat

MVP

sretalla

Powered by Neutrality

Redcoat

MVP

jlw52761

Explorer

jlw52761

Explorer

Similar threads