Unveiling ZFS Replication Quirk: Your Destination Snapshots at Risk?

Johnny Fartpants · Aug 12, 2023

It’s a nice idea would like to try it in practice and consider how to automate it better.

NickF · Aug 12, 2023

Let me spend some time clarifying my position here in general before I answer the direct question. We are all speculating and making assumptions surrounding OP's situation. We don't know if its some rando server in a basement for Plex or if it's a production environment filled with critical records. Since these variables are unknown I am being conservative in my answers here and speaking with my sysadmin hat on, not my homelab hat.

Johnny Fartpants said:
Ok so this is a manual add-hoc suggestion?

This is a manual ad-hoc suggestion, yes. Create the job. Run it once. Hold the first snapshots on both sides. Done. If the datasets are highly dynamic this is not an easy thing to manage. You would have to manually remove holds and create new ones periodically. You can easily script this with CRON and a bash script if you wanted, but for backups I'd actually prefer the manual aproach.

If the datasets don't change so much, you can probably use this strategy for years without ever looking back...

Johnny Fartpants said:
I guess this won’t help the situation the OP has raised around been hijacked and someone letting rip and rolling back all the datasets?

Let's assume the source side is compromised and they took that action, the data is fucked on that system. That's why we have a backup system which should have a different root password and/or 2FA.

Again - backup and disaster recovery is an artform. KISS is a great idea in principle and is fine for alot of situations. However, once you start playing devils advocate and try to account for variables and multiple different attack vectors...it gets complicated. There's a reason there are degree and certification programs for cyber security. In these situations it's up to you, the sysadmin, to engineer your solutions, understand how they work, and TO TEST THEM. ZFS snapshots and replication are only one tool in the toolbox.

PS read the resource if you have not yet :) I kick it up to 11.

Resource - TrueNAS SCALE: A “Datacenter-in-a-box"

As a pre-requisite, consider reading my TrueNAS SCALE virtualization guide! https://www.truenas.com/community/resources/getting-started-with-virtualization-on-truenas-scale.214/ This resource is meant to be a series. Part 1 will cover both the...

www.truenas.com

tnuser9999 · Aug 12, 2023

I really appreciate the discussion around this. I do agree that we all should be using the 3 - 2- 1 backup strategy. With the current state of ZFS replication on TrueNAS, you should definitely not be relying on the data being there in the case you get owned by a bad actor. Until there is an option where source changes do not affect the destination snapshots in this case such as a rollback.

You can make sure a system is as secure as you can but that is no gaurantee that the system can not be exploited. Relying on the assurance a system is impenetrable is dangerous. FYI I do work in cybersecurity!

NickF · Aug 12, 2023

tnuser9999 said:
Until there is an option where source changes do not affect the destination snapshots in this case such as a rollback.

That is fundamentally incompatible with the very concept of a snapshot. The entire purpose of using ZFS send is to ensure snapshots are to exactly replicate data between two pools or systems.

Can you elaborate on the failure mode a bit? I want to make sure I am understanding the ask here.

winnielinnie · Aug 12, 2023

Johnny Fartpants said:
Ok so this is a manual add-hoc suggestion? I guess this won’t help the situation the OP has raised around been hijacked and someone letting rip and rolling back all the datasets?

Johnny Fartpants said:
Unless they zfs hold one snap on every dataset on both sides?

To be honest, I don't think any of this matters. If someone gets root access to your TrueNAS box, they can just destroy everything on both sides. (I say both sides, because they have de facto access via the stored SSH keys on the server.)

So really, any "solution" is only meant to mitigate accidents, mistakes, and unintentional quirks of replications and rollbacks. Using "hold" (or any feature, really) won't help much if someone gains root access.

Johnny Fartpants said:
Then remember to release them from time to time so they don’t run out of space?

This is feasible with the command-line. I use it myself. It's not complex to "once in a while" use hold and release to at least offer some sort of "safeguard" against accidents or unintentional snapshot destruction.

Still wish it was integrated into the GUI.

NickF · Aug 12, 2023

winnielinnie said:
To be honest, I don't think any of this matters. If someone gets root access to your TrueNAS box, they can just destroy everything on both sides. (I say both sides, because they have de facto access via the stored SSH keys on the server.)

Yeah that's fair enough. Hence my point about 2FA. Allowing someone to easily get root access to your box is a non started FOR ANY backup software...

To use Veeam as an example again lets assume you put your Veeam server on your domain. Now let's assume a domain admin user got compromised. What do you think one of the first actions might be from the bad actor?

In my sysadmin life I handled this by backing up critical services, including my Veeam server, to a Synology with Active Backup for Business. Different passwords, different os, etc...

Insert obligatory Macklemore reference. There's layers to this shit playa, tiramisu.
https://www.youtube.com/watch?v=JGhoLcsr8GA"

Also FWIW you probably should not run your TrueNAS with SSH enabled outside of troubleshooting. It's an attack vector you shouldn't leave open unless needed. It will need to be enabled on the destination server because of how replication works on TN, but it should not be on the source.

winnielinnie · Aug 12, 2023

NickF said:
Also FWIW you probably should not run your TrueNAS with SSH enabled outside of troubleshooting.

No SSH? So no command-line from a client terminal. No rsync over SSH. You lose quite a bit from disabling SSH on TrueNAS.

NickF · Aug 12, 2023

winnielinnie said:
No SSH? So no command-line from a client terminal. No rsync over SSH. You lose quite a bit from disabling SSH on TrueNAS.

I edited my post to clarify source/destination logic.

tnuser9999 · Aug 12, 2023

winnielinnie said:
To be honest, I don't think any of this matters. If someone gets root access to your TrueNAS box, they can just destroy everything on both sides. (I say both sides, because they have de facto access via the stored SSH keys on the server.)

So really, any "solution" is only meant to mitigate accidents, mistakes, and unintentional quirks of replications and rollbacks. Using "hold" (or any feature, really) won't help much if someone gains root access.

This is feasible with the command-line. I use it myself. It's not complex to "once in a while" use hold and release to at least offer some sort of "safeguard" against accidents or unintentional snapshot destruction.

Still wish it was integrated into the GUI.

Please help me understand what you mean? If you are replicating via the pull option the only key that you must enter on your source TrueNAS box would be the public key. The source box doesn't have to have SSH access to the destination box. I wouldn't use a push option ever if I cared about the data being protected on the destination. Now if you do not rely on ZFS replication for backup, that is handled with another solution, and you are just using it for replication, access to the data quickly in specific failure situations, PUSH is probably fine.

tnuser9999 · Aug 12, 2023

NickF said:
That is fundamentally incompatible with the very concept of a snapshot. The entire purpose of using ZFS send is to ensure snapshots are to exactly replicate data between two pools or systems.

Can you elaborate on the failure mode a bit? I want to make sure I am understanding the ask here.

If you are using destination as a backup solution, I would say you would not want what happens to source zfs snapshots to affect destination zfs snapshots. That is a dangerous situation to be in, you better hope you have another backup solution as well.

During my testing destination snapshot retention is not affected by deletion of snapshots on source which is great if you choose the custom retention on the destination server! I even tested deleting all common snapshots which then the replication fails as expected but destination snapshots are intact and in a diaster recovery scenario it would be come the active server or you wipe source and replicate fully back to source.

However where I found the failure was if a rollback is performed on source, and a replication happens, destination snapshots are destroyed as well and things continue to work which is awesome if that is what you want, just replication just don't rely on that for a backup.

What you would want to happen ideally if you rely on replication as a backup, is after a bad actor decides to destry your data with a roll back on source, a replication takes place and it fails. You get an alert oh boy! Batton down the hatches, least your data is safe. If you attempt a serious operation like a rollback on source intentionally, I would hope while you are manually doing that you would also manually rollback your destination server to the same snapshot and things continue work.

winnielinnie · Aug 12, 2023

tnuser9999 said:
The source box doesn't have to have SSH access to the destination box.

That's true. Double-dyslexia? I associate references to "source" as local, and "destination" as remote.

When it's specifically pull replication, the terms can be mixed up.

WI_Hedgehog · Aug 12, 2023

NickF said:
I fundamentally and completely disagree with you. Whether we are talking about ZFS replication or simply having a second copy of your data with RSYNC or some other methodology is literally the definition of a good backup strategy. Also ZFS replication is one of the few solutions to this problem that preserves ACLs and XATTRs...which is critical for some workloads.

For the particulars of ZFS replication, you can literally use the TN wizard in a "set-and-forget" way using the defaults, or you can tune it to do all sorts of advanced things like we have been talking about in this thread.

Pray tell, what "archival" software on a remote system offer better features, performance, reliability, simplicity, etc? You talked about virtual box and why it's differential snapshots are great above. This is exactly what ZFS snapshots do. Since replication literally relies on (requires) snapshots its not a dissimilar situation.

Taking things out of context for the sake of being argumentative is not constructive.

VBox snapshots are useful, though certainly not backups and do not function well as such, which often applies to snapshots in general.

Rather than writing a manual on how to use ZFS snapshots and replication to make backups for a specific use case which is likely to change in the future and break the process (which is at the core of much of this thread) it seems more productive to use archival software designed for the task. Archival software simplifies otherwise manual processes-- why reinvent the wheel?

Patrick M. Hausen · Aug 12, 2023

I must admit that I have not yet thought about "pull" replication much, but as an old school network security engineer I can see how it could raise the bar for a hacker by quite some margin. Classical firewall and DMZ concepts all revolve about the question who initiates a connection to where.

I currently use push everywhere where I use TrueNAS. It's nice to control all aspects including different retention periods from the production system. But if your threat scenario is "the production system is hacked - how can I protect the backup?" then pull just makes so much more sense.

As for some numbers concerning the frequency - on our production hypervisors we use hourly snapshots with 1 week of retention. These machines run in a pair each serving half of the production VMs and an replicating them to the partner system. The assumption is that the greatest risk is one of the machines breaking. In that case we can just boot the VMs from the replicated images on the other node.

For my private long term storage needs I also snapshot VMs and jails hourly - so I can roll back a failed upgrade or other config changes quickly. These are replicated daily to local RAIDZ2 spinning rust pool with 2 weeks retention. All of this plus the file sharing data goes to a remote backup system, daily snapshots, 3 months retention, thinking of increasing to 6 months.

In our hosting environment we keep 2 weeks of daily snapshots on a remote system unless otherwise agreed upon with the customer.

What I can say from our experience with our hosting environment is that tens of thousands of snapshots in a ZFS pool is not a problem. Not at all.

Just my 2 ct. on some aspects of this discussion. Keep it going

Johnny Fartpants · Aug 13, 2023

How about zpool checkpoint on backup system to protect against this? You could schedule a cron job to create a checkpoint say for example every week and then release and repeat every week. Naturally you could do day or month also. That way in the event of any issue like rollback or dataset deletion you would be covered.

tnuser9999 · Aug 13, 2023

I might investigate implementing the approach you mentioned, utilizing checkpoints to safeguard against this behavior. I still find ZFS replication intriguing, and TrueNAS's implementation simplifies management. Alternatively, I might choose to accept the limitations while appreciating its convenience for quick daily replication. For weekly backups, I would employ a solution like Restic, Kopia, or another backup tool to address the shortcomings of TrueNAS replication, following the 3-2-1 backup strategy.

Another consideration is delving deeper into FreeBSD and constructing a NAS. This would enable me to leverage tools like Syncoid for replication, which I believe incorporates certain safeguard features. Naturally, before delving into this extensively, I would need to achieve the same level of familiarity with Syncoid as I have with TrueNAS replication. Additionally, I would need to ascertain whether I can configure it in a manner that prevents the deletion of snapshots on the destination spawned from changes on source.

tnuser9999 · Aug 13, 2023

I am also thinking about using zfs diff as a "file integrity monitoring" solution. Seems like a quick way to see what has changed daily in the datasets you wish to do so with. I will have to think about how I would script that to automate and how I would provide the reports daily.

winnielinnie · Aug 13, 2023

tnuser9999 said:
following the 3-2-1 backup strategy

The 3-2-1 backup strategy is no longer sufficient to keep your data safe. You need to think "bigger".

WI_Hedgehog · Aug 14, 2023

winnielinnie said:
The 3-2-1 backup strategy is no longer sufficient to keep your data safe. You need to think "bigger".

You want 4-3-2-1?

tnuser9999 · Mar 23, 2024

I thought I would pop back in to see if anything changed with TrueNAS scale. Unfortunately, I tested, and Scale is susceptible to the same vulnerability in a pull backup scenario. Snapshot rollbacks on source affect the snapshots on destination still. I tested by rolling back on source and then creating a new snapshot. I then initiated a pull replication from the destination, which removed any snapshots that did not exist on source since the rollback and added the newly created snapshot.

Great for two systems trying to stay identical, however not suitable for a situation where one is being used as a backup. A bad actor or a mistake by an admin on the source could wipe out data both there and where you were hoping it would be in case of data loss on source. I was hoping there was an option to not do this and just fail requiring manual intervention.

Important Announcement for the TrueNAS Community.

Unveiling ZFS Replication Quirk: Your Destination Snapshots at Risk?

Guru

Guru

Dabbler

Guru

MVP

Guru

MVP

Guru

Dabbler

Dabbler

MVP

Guru

Hall of Famer

Guru

Dabbler

Dabbler

MVP

Guru

Dabbler

Similar threads