RAIDZ expansion, it's happening ... someday!

Ericloewe · Oct 24, 2017

nightshade00013 said:
Ever been to a theme park? Ever actually read the receipt that you got when you paid. You basically have no recourse for ANYTHING that may happen to you or anyone you are responsible for. And in a lot of states failing to read and follow the directions on the signs puts you personally at legal risk, some even verbal instructions by an operator have to be followed. So I could basically tell you to strip to ride a ride and LEGALLY you are required to do it. Not to mention that the rides can easily kill, most are just industrial machines with a pretty facade built around it. I ended up quitting because my supervisor was totally clueless about safety among other things and management didn't have the balls to fix the issues. I had no desire to be there when someone did get killed.

The reason why I mentioned this is because I worked at one once for a couple years and no matter how many times you post something or say something or play a message people will STILL do the opposite. I can post signs that say "once you enter this park you are transferring your home and all monies or other worldly possessions to the park owners," and most people will not blink at it until after it has happened.

Another more crude way to put it is, you can't fix stupid.

Your points are valid, but there is a difference between passively ignoring a warning notice and actively seeking a path around the barrier that is covering the open manhole with the express intent of standing on top of the open manhole.

nightshade00013 · Oct 24, 2017

I agree and I have watched people jump fences to go grab lost items as well, the signs are there the barriers are there and yet still they just can't wait.

It's just a fact of life you sometimes have to deal with and then tell them, "Hey your now screwed, sorry you did it to yourself."

Jailer · Oct 24, 2017

Ericloewe said:
Yes, it solves 99% of the problem.

I really don't understand it. The GUI throws warnings at people and they just ignore them. Where have they been desensitized to warning messages?

EULA

Click, click, click, click...oh crap!

nightshade00013 · Oct 24, 2017

Jailer said:
EULA

Click, click, click, click...oh crap!

Yeah I wonder how many EULA's for some obscure piece of software out there end up signing someone into slavery just for the fun of it.

danb35 · Oct 27, 2017

So, the talk was Wednesday. Have video, notes, or other details been posted yet?

Ericloewe · Oct 27, 2017

Yes, it's linked from the openZFS project site, somewhere on Youtube.

Shakycam warning, by the way.

danb35 · Oct 27, 2017

Shakycam wasn't too bad. Points from the video:

It will work with RAIDZ1/2/3
Sounds like you can expand the vdev any number of times
It will not work if your vdev has any missing devices
The parity:data ratio in place at the time data was written doesn't change. So, with an n-disk RAIDZp pool, that ratio will be p:n-p. If you add a disk to that pool, the existing data will have that same ratio, but newly-written data will be at p:n-p+1. I don't think my mind has worked its way around the implications of this yet.
The work was, at least in part, sponsored by iX (thanks, iX!)
No code has yet been written, they're hoping to have a live demo next year. No ETA.
Increasing RAID level isn't in the scope of this project, but is "a natural extension" of it.
Adding multiple disks at one time is also out of the scope, but a natural extension.

rs225 · Oct 27, 2017

Raidz expansion will be a good feature, just because it works very naturally with how people think or wish ZFS already worked.

I am hoping to see the scrub improvements soon, as the sequential scrub now has improved prefetch behavior.

nightshade00013 · Oct 27, 2017

The one thing that I would like to see change is how they are going to "reflow" the data. I would like to see something that can empty the data out of the first X number of blocks to an open space on another part of the pool and then rewrite the data for the new drive config. From what I gather the stripe width on the old data will still be the same as it was before. By moving the first lets say 5 to 10 % to an open space on the original disks and then reading it and putting in a stripe as if it is new data across the entire pool it would easily allow changes to a different RaidZ level as well as allow the reflow to defragment the pool.

The way they are proposing it to happen will be less intensive but will basically just move the fragmentation and not gain any benefit in the strip width being larger. Honestly it sounds as if in a way it will make fragmentation even worse. Or maybe I am understanding it wrong?

Ericloewe · Oct 27, 2017

danb35 said:
It will not work if your vdev has any missing devices

To clarify: If a drive needs replacement during this expansion process, it is interrupted until the resilver is finished.

danb35 said:
The parity:data ratio in place at the time data was written doesn't change. So, with an n-disk RAIDZp pool, that ratio will be p:n-p. If you add a disk to that pool, the existing data will have that same ratio, but newly-written data will be at p:n-p+1. I don't think my mind has worked its way around the implications of this yet.

Not many, really. It just means that old data will consume the same size as it did before - say you have a six-wide RAIDZ2 vdev, with not-tiny data. Not-tiny data will be cut up into units of four chunks plus two of parity for a storage overhead of 33%. After expanding it to seven-wide, new writes will be cut up into five chunks plus two parity for a storage overhead of 29%. So, if zpool list says you're using 12TB before (8TB of data, 4TB of parity), you'll still be using 12TB after the expansion, because only new data benefits (well, it's a trade-off, not a straight win) from the reduced relative amount of parity. The next 8TB of data will not require 4TB of parity, but only 11,2ishTB.

The real implication of this is that a straight switch to a higher parity level (e.g. RAIDZ1 to RAIDZ2) would keep all existing data at the old parity level. So, if you had 8TB of data on a RAIDZ1 vdev, after turning it into a RAIDZ2 vdev, you'd still have all 8TB of data with only simple parity instead of double parity, with only new data being written with dual parity. So, the natural extension of intra-level expansion is probably bad overall, as it lulls the user into a false sense of security.

So, why not just re-write the data in the equivalent spots? I see two big problems with that:

It's massively slow because you don't get the exponentially-growing free space that allows the process to speed up rather quickly
It doesn't always work, particularly when the pool is very full

Let me elaborate on that second one. If you want to write a very small amount of data, it doesn't necessarily get striped over n-p disks. If it fits on a single disk's stripe, ZFS just writes the one stripe plus p parity chunks (similarly for intermediate amounts). You'll notice that this requires much more storage for parity than a big file would, especially as the vdev grows wider. Adding a single disk may not be enough in all cases to fit the additional parity. I think this can work in at least most cases if you add two disks at once to increase the parity level by one, but I have to properly examine the issue.

Disclaimer: My math has not been checked.

nightshade00013 said:
The one thing that I would like to see change is how they are going to "reflow" the data. I would like to see something that can empty the data out of the first X number of blocks to an open space on another part of the pool and then rewrite the data for the new drive config.

That would require Block Pointer Rewrite, which is a very hard problem to solve in a safe, CoW, atomic way on an online pool.

danb35 · Oct 27, 2017

Ericloewe said:
That would require Block Pointer Rewrite, which is a very hard problem to solve in a safe, CoW, atomic way on an online pool.

Which is why it's pretty much the holy grail for ZFS. Implementing it would allow for vdev expansion/shrinking, vdev addition/removal, RAIDZ level increase/decrease, and a partridge in a pear tree.

nightshade00013 · Oct 27, 2017

But if you were to delete data from the first X number of blocks it would basically be doing the same wouldn't it? The data is there in reality but the block pointers are now removed as the data is marked for deletion. Under the same idea the pool itself copies the data to another space and then once the data is safely stored the original blocks are marked for deletion which makes them available for reflow and the process can begin.

There would have to be a minimum amount of free space but in a well thought out and properly created pool that should not be a problem. That or use a sort of "scratch" drive for use only during the rebuild, a pair of drives that are mirrored and of a minimum size so that no one file is larger than the mirror.

I know my idea of how things are setup are probably way off and been over hundreds of times but I do want to understand it all better.

Ericloewe · Oct 27, 2017

nightshade00013 said:
But if you were to delete data from the first X number of blocks it would basically be doing the same wouldn't it?

No. This work operates below block pointers. The data has to be in the same relative position (which is implicitly defined by the structure of the RAIDZ vdev). Block pointers are never touched during these operations.

BPR is a very difficult problem because the tree is not designed to be traversed from leaf to root. Since any number of block pointers can point at the same block, you have to find them all. That's nasty, as it means going through the whole pool. Several times, because you won't be able to fit all the metadata you need in RAM. And then, you have to update them all atomically (well, this part is easily fixed with a short log, so it's not the real problem). And you simultaneously have to figure out which block pointers point at the block pointers you changed and change them too, all the way up to the überblock.

As a programmer, this is hell. As an engineer, this is nope.avi.

danb35 · Oct 27, 2017

Ericloewe said:
As a programmer, this is hell. As an engineer, this is nope.avi.

But as a user, it is

MV5BOTViY2Y0ZGItMTg2OC00YzEzLWJhYjYtZjg4OTMyOWE4YzM1XkEyXkFqcGdeQXVyNTQ1NzU4Njk@._V1_.jpg

Ericloewe · Oct 27, 2017

You mean I just have find the Castle of Arghhhhhhhh and convince the French to hand it over?

danb35 · Oct 27, 2017

Might be easier than coding it, from what I'm hearing... But be careful you don't get confused by the Castle Anthrax.

rs225 · Oct 27, 2017

I've always understood block pointer rewrite not to be a 'holy-grail' or feature. I've understood it to be "you are asking for block pointer rewrite, which when you think about it, is never going to be done because it is impossible, breaks everything, and would be useless on a pool larger than a floppy disk."

Stux · Oct 27, 2017

Offline BPR ;)

nightshade00013 · Oct 27, 2017

Stux said:
Offline BPR ;)

Well if it could work? LOL

Ericloewe · Oct 27, 2017

rs225 said:
I've always understood block pointer rewrite not to be a 'holy-grail' or feature. I've understood it to be "you are asking for block pointer rewrite, which when you think about it, is never going to be done because it is impossible, breaks everything, and would be useless on a pool larger than a floppy disk."

Well, BPR that works in quasi-linear time is the holy grail.

Important Announcement for The TrueNAS Community.

RAIDZ expansion, it's happening ... someday!

Server Wrangler

Wizard

Not strong, but bad

Wizard

Hall of Famer

Server Wrangler

Hall of Famer

Guru

Wizard

Server Wrangler

Hall of Famer

Wizard

Server Wrangler

Hall of Famer

Server Wrangler

Hall of Famer

Guru

MVP

Wizard

Server Wrangler

Similar threads