Refactor Pool and Remote Replication

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
In the next couple years, I expect to hit 80% utilization on my pool. I'm mostly out of bays, so my plan is to expand the pool by replacing the oldest drives with higher capacity models. I'm concerned that after I do that, the new disks will be large enough that I will have increased risk of a dual-failure during re-silver operation. Therefore I want to refactor both vdevs comprising the pool from Z1 to Z2.

Here is the background:
  • I am using SCALE
  • I have 30 containers running
  • Pool is comprised of 2 vdevs
  • Each vdev is a Z1 configuration, 5x 2 TB and 5x 4 TB
  • I want to refactor each vdev from Z1 to Z2
  • A friend has plenty of spare capacity on his Z2 pool on a CORE server
  • I have the important data already backed up on GCS and S3, but it's object storage with high retrieval costs (intended for worst-case recovery)
  • I have the largest datasets also replicated to a single extra disk on my system (that I normally use as large scratch space), hoping to prevent pulling back the entire set from the friend if possible, but see question about this below
  • rsync.net is a great option to enable this procedure too, but with the willing friend I will save a few hundred dollars, and I have one concern that applies to this option as well as the friend option, that I will add to the questions below
Here are my steps:
  • Friend creates a non-superuser account on his CORE server, and I give him my SSH pubkey
  • Friend creates a dataset for me and assigns full permissions to my user
  • Friend runs `zfs allow <user> create,destroy,diff,mount,readonly,receive,release,send,userprop <pool/dataset>`
  • I add this as an SSH Connection on my SCALE server
  • I recursively snapshot my pool
  • I create a replication task to push all of my datasets to his server, including hundreds of child snapshots (using destination encryption)
  • I have ~11 TiB to push over via my 500 Mbps upstream to his 1 Gbps downstream; best case this will take ~3 days
  • When the replication finishes, stop all services and containers
  • Recursively snapshot again
  • Run the replication task again to get any stragglers
  • Backup the SCALE database and SSH keys
  • Remove one of the boot pool mirror drives to preserve the running configuration
  • Reset the SCALE configuration to defaults
  • Wipe the pool disks
  • Create new vdevs from the same disks, but using Z2 this time instead of Z1, using the original pool name
  • Setup the replication tasks again
  • Replicate available datasets from the single disk to the new pool (see question about this below)
  • Replicate the remaining datasets back from the friend's pool to the new pool
  • Swap the boot pool mirror drives to restore the original configuration
  • Restore the SCALE database
  • Reboot and hope that all of the services and containers come into operation like nothing happened
Questions:
  • Does ZFS checksum single-disk pools? It seems yes, since when I check the pool status I see the read/write/cksum counters, so I should be able to scrub after replication, and also watch for errors while replicating back to the refactored pool, too. (I understand of course if there is an error, that data is lost; I only want this to optimize the speed at which I can recover from the refactor, and if I encounter any error I will fallback to pulling from the remote server.)
  • Assume I am running the latest pool version in SCALE and he is running the latest pool version in CORE; isn't SCALE still ahead on the ZFS version? Can I send datasets to a pool running an older version? Or does the version conflict only cause problems when attempting to import? This could also affect the rsync.net option.
  • CORE/FreeBSD has a sysctl to allow non-root to mount ZFS datasets, but I shouldn't need to use this if I'm just replicating and not trying to access the data directly on his server, right?
  • If he were to upgrade to SCALE, due to the possible ZFS pool version mismatch described above, SCALE/Linux does not support mounting datasets by non-root users, so if he upgraded to SCALE, would I still be able to replicate to a non-root user since I don't need to mount the datasets on his server?
  • If my new pool and datasets are exactly the same names after the refactor, all my services and containers should work after restoring the database, right?
  • Any more gotchas I didn't think of?
Thanks in advance!

Edit: added another idea for the configuration restore by just swapping the boot pool mirror drives, instead of restoring the database.
 
Last edited:

truecharts

Guru
Joined
Aug 19, 2021
Messages
788
Be carefull when you're using SCALE Apps as well, you shouldn't snapshot the ix-applications dataset and you cannot restore it on a new system with just the snapshot anyway.

We've a guide for backup and restore of the Apps system here:
 

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
Be carefull when you're using SCALE Apps as well, you shouldn't snapshot the ix-applications dataset and you cannot restore it on a new system with just the snapshot anyway.

We've a guide for backup and restore of the Apps system here:
Thank you! I already snapped it once for a test, but I haven't done any upgrades or anything, so I'm probably (hopefully) still safe. I'll delete that snapshot now. Thanks again! I'll read the guide.
 

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
Argh. @truecharts, I noticed that my recursive snapshot of ix-applications created 653 snapshots for all the child datasets.

Should I delete those snapshots? Or will they cause no harm to leave in place?

And why can't I send the entire ix-applications parent dataset, and restore it, before swapping my boot pool mirrored drives back to the original configuration? As long as the apps configuration uses the pool name, and not some uuid/serial of the pool and/or root dataset, it should never know anything happened...

I will still perform the backup as shown in the guide you linked, but I will also test my method in a VM and see what happens.

Please advise if there's something I'm missing here. Thank you!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
If the eventual source and target systems are the same one (not two distict even if identical systems), I don't think it's a problem (as long as you don't care about having the apps all stopped while you do the replication out and in). ZFS replication is done at block level and results in pool contents that should be indistinguishable from an application's perspective, making it no different to taking the pool offline and putting it back with the same data on it... more-or-less what happens in a reboot.

The replication to a "backup system" with the intent to bring that system up to replace the primary in a disaster is the scenario that would be problematic/impossible in my view.
 

truecharts

Guru
Joined
Aug 19, 2021
Messages
788
Argh. @truecharts, I noticed that my recursive snapshot of ix-applications created 653 snapshots for all the child datasets.

Should I delete those snapshots? Or will they cause no harm to leave in place?

Currently, backup also makes those... It's a bug, it's basically making snapshots of each docker layer.
It should be solved in nightly and release.

And why can't I send the entire ix-applications parent dataset, and restore it, before swapping my boot pool mirrored drives back to the original configuration? As long as the apps configuration uses the pool name, and not some uuid/serial of the pool and/or root dataset, it should never know anything happened...

There is more to Apps than just the ix-applications dataset. There is a whole configuration layer besides that.
That layer is NOT designed to deal with moving the ix-applications dataset.

If you move it to another system or even a reinstall on the same system, this WILL break your Apps, if you did not correctly use a backup and restore.
If it is truely the same system and the poolname stays the same, you can theoretically do what you said, but ANY form of change of the boot or applicationspool, will (start to) break things.

(and yes, we already did research on your method. no need to test it again)
We've verified with iX developers to ensure our method correctly works with their middleware. Othermethods (such as the one proposed) are not guaranteed to be safe for your data.

Also word of warning: Restore is currently broken, it's fixed on nightly.
 

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
tl;dr: with exception of the restoring the boot pool mirror, everything worked perfectly, and I now have my main pool vdevs refactored from 2xZ1 to 2xZ2. All apps/containers/VMs/shares/tasks were preserved and everything is running great. Added bonus: 0% fragmentation on my pool now after clearing 10 years of cruft!

Everything went well except rebuilding the original boot pool into a mirror.

Before reinstalling the original version boot disk from the original mirror, while still running the temporary boot disk, I used dd to write zeros to the first 20 MB of the disk, hoping to prevent the original OS from being confused as to which of the two boot disks were the correct version and causing an incorrect resilver.

This worked, and I was able to perform the replace operation, and I saw the resilver complete successfully. One last reboot to test, and unfortunately the boot halted with a kernel panic. I then tried each of the mirrored boot drives separately, and had the same result.

I was able to boot into maintenance mode on an older release without the kernel panic, checked zpool status, and saw unrecoverable corruption reported. I'm not sure how this happened, as the original boot disk booted just fine after reinstalling it, and the resilver completed successfully to the temporary drive afterward. But somehow corruption happened.

I ended up booting the TrueNAS SCALE install media, and performing a format & fresh install on the boot pool mirror. I imported the backed-up config database, and voila everything is running perfectly again.

I do wonder now, whether I should move the system dataset from the boot pool mirror to the refactored 2xZ2 pool. But as long as I'm keeping database backups regularly (always after major changes, and before any upgrades) I guess the system dataset isn't so important since I don't need it to fetch the backup database from. Afaict, no other important data is kept in the system dataset that cannot be regenerated after import of the config database, as my 30 containers and Kubernetes configurations were all preserved.
 
Top