ZFS device removal - what about pools with RAIDZ vdevs?

Status
Not open for further replies.

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Perhaps not worthless, but a first step.
I hope so, but after some further thought about the last presentation I watched, I'm suspecting that they'd need to use a completely different technique to make it work at all. The current technique, as I understand it, is pretty much:
  • Use the spacemap to determine which blocks on the disk(s) to remove are used
  • Copy those blocks to remaining devices in the pool
  • Update a reference table pointing the old locations to the new
This is relatively fast (since the spacemap is small) and easy. It also ignores checksums and doesn't touch the block pointers (which contain the checksums). But since it doesn't actually process the data, there's no way to account for needing to compute and write parity. And, without going through the block pointers, I don't think there would be any way to know where the logical stripes begin and end.

I don't know anywhere near enough about ZFS internals to say with any certainty, but it sounds like a very different task.
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
So set up a VM with 11.2-BETA2 and created a RAIDZ1 pool with three disks:
Code:
root@freenas:~ # zpool status tank
  pool: tank
 state: ONLINE
  scan: none requested
config:

   NAME											STATE	 READ WRITE CKSUM
   tank											ONLINE	   0	 0	 0
	 raidz1-0									  ONLINE	   0	 0	 0
	   gptid/7bfd1e52-a075-11e8-b077-6173bfff8520  ONLINE	   0	 0	 0
	   gptid/7f616dbe-a075-11e8-b077-6173bfff8520  ONLINE	   0	 0	 0
	   gptid/8383057d-a075-11e8-b077-6173bfff8520  ONLINE	   0	 0	 0

errors: No known data errors
root@freenas:~ # zpool list
NAME		   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
freenas-boot  15.9G  2.24G  13.6G		-		 -	  -	14%  1.00x  ONLINE  -
tank		  89.5G   551M  89.0G		-		 -	 0%	 0%  1.00x  ONLINE  /mnt


Decided to hate my data and stripe in a fourth disk, bypassing the warning I posted above:
Code:
root@freenas:~ # zpool status tank
  pool: tank
 state: ONLINE
  scan: none requested
config:

   NAME											STATE	 READ WRITE CKSUM
   tank											ONLINE	   0	 0	 0
	 raidz1-0									  ONLINE	   0	 0	 0
	   gptid/7bfd1e52-a075-11e8-b077-6173bfff8520  ONLINE	   0	 0	 0
	   gptid/7f616dbe-a075-11e8-b077-6173bfff8520  ONLINE	   0	 0	 0
	   gptid/8383057d-a075-11e8-b077-6173bfff8520  ONLINE	   0	 0	 0
	 gptid/fc0246fa-a075-11e8-b077-6173bfff8520	ONLINE	   0	 0	 0

errors: No known data errors
root@freenas:~ #

Oh noes, there's data on that striped disk!
Code:
root@freenas:~ # zpool list -v
NAME									 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
freenas-boot							15.9G  2.24G  13.6G		-		 -	  -	14%  1.00x  ONLINE  -
  da0p2								 15.9G  2.24G  13.6G		-		 -	  -	14%
tank									 119G  14.3G   105G		-		 -	 0%	12%  1.00x  ONLINE  /mnt
  raidz1								89.5G  13.4G  76.1G		-		 -	 0%	14%
	gptid/7bfd1e52-a075-11e8-b077-6173bfff8520	  -	  -	  -		-		 -	  -	  -
	gptid/7f616dbe-a075-11e8-b077-6173bfff8520	  -	  -	  -		-		 -	  -	  -
	gptid/8383057d-a075-11e8-b077-6173bfff8520	  -	  -	  -		-		 -	  -	  -
  gptid/fc0246fa-a075-11e8-b077-6173bfff8520  29.5G   981M  28.5G		-		 -	 0%	 3%
root@freenas:~ #

So, the moment of truth:
Code:
root@freenas:~ # zpool remove tank gptid/fc0246fa-a075-11e8-b077-6173bfff8520
cannot remove gptid/fc0246fa-a075-11e8-b077-6173bfff8520: invalid config; all top-level vdevs must have the same sector size and not be raidz.


So, confirmed: device removal is pretty much worthless.
Did you try rebooting before kicking off the device removal? I ran into the same issue until I rebooted my test VM.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Did you try rebooting before kicking off the device removal?
I didn't, but a reboot doesn't change the outcome--I get the same error message.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The weird thing about this device removal, is that whence all the data is migrated, and now using in-direct pointers, it can be remapped.
If I understand it correctly, the command is zpool remap POOL. This causes the in-direct pointer table, (static, and in memory at all times), to shrink and the blocks start owning "concrete" blocks, (normal, non-indirected blocks). Don't quote me, I don't clearly understand it. Just parroting what I vaguely remember.

As for removing a single disk from a RAID-Zx pool, I can see that it would have to re-write the data using whatever parity level is in the remaining vDevs.

However, I just thought of something. (Oh, I know, I'm not supposed to think!) Sun Microsystems added a ZFS Pool feature called RAID-Z/mirror hybrid allocator in pool version 29. What if we added a similar feature, (maybe not pool version 29 compatible...), and used it for removing the single disks?

Meaning in a RAID-Z1, we would have to have 2 copies of the original singleton blocks, and RAID-Z2, 3 copies. And 4 copies for RAID-Z3. It would meet the redundancy requirements, and we could still remove single disks. It's not perfect, but we may consider it phase 2. Plus, we get the neat RAID-Z/Mirror Hybrid allocator feature out of the work. We simply abuse the feature for single disk removals.

Now that I have written that, I am wondering if I should open an OpenZFS feature request, (2 part), on the issue?
Concurance?
Or am I completely off my rocker? (And I don't even own a rocker any more!)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Perhaps not worthless, but a first step.

This coming weekend I may play with the master of ZFS on Linux again. The following list says device removal, pool checkpoint, and encryption are on ZOL master, (available from GIT);

https://soluble.zgrep.org/zfs.html

I'd hoped that D-RAID would have made it, but the pull request is still working it's way through testing and approvals. It does seem to be getting closer. One issue was that they had to fix a bunch of quirks and bugs to make a new top level vDev type available. Another issue was pool / vDev creation method, (no external program or file was highly desired). But D-RAID can be another thread :).

Oh, I thought DRAID had stagnated to a halt, good to see work on it continue.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The weird thing about this device removal, is that whence all the data is migrated, and now using in-direct pointers, it can be remapped.
If I understand it correctly, the command is zpool remap POOL. This causes the in-direct pointer table, (static, and in memory at all times), to shrink and the blocks start owning "concrete" blocks, (normal, non-indirected blocks). Don't quote me, I don't clearly understand it. Just parroting what I vaguely remember.

As for removing a single disk from a RAID-Zx pool, I can see that it would have to re-write the data using whatever parity level is in the remaining vDevs.

However, I just thought of something. (Oh, I know, I'm not supposed to think!) Sun Microsystems added a ZFS Pool feature called RAID-Z/mirror hybrid allocator in pool version 29. What if we added a similar feature, (maybe not pool version 29 compatible...), and used it for removing the single disks?

Meaning in a RAID-Z1, we would have to have 2 copies of the original singleton blocks, and RAID-Z2, 3 copies. And 4 copies for RAID-Z3. It would meet the redundancy requirements, and we could still remove single disks. It's not perfect, but we may consider it phase 2. Plus, we get the neat RAID-Z/Mirror Hybrid allocator feature out of the work. We simply abuse the feature for single disk removals.

Now that I have written that, I am wondering if I should open an OpenZFS feature request, (2 part), on the issue?
Concurance?
Or am I completely off my rocker? (And I don't even own a rocker any more!)
Removing RAIDZ disks seems of limited usefulness. It'd be better to work on remapping mirrors to RAIDZ vdevs. Just a matter of priorities.

Device removal is basically old Delphix internal work that finally got upstreamed. Their major driver was "our customers want to move from n mirrors to m mirrors to downsize their storage or because disks have grown, with n > m", which is a weird thing that few people actually ended up using.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Removing RAIDZ disks seems of limited usefulness. It'd be better to work on remapping mirrors to RAIDZ vdevs. Just a matter of priorities.
....
No, what I meant was removing single disk vDev, (aka striped disk), from a pool consisting of RAID-Zx vDev(s). The exact issue we were discussing.

And what I suggested would work fine for remapping Mirror vDev(s) to RAID-Zx vDevs. Just ignore one of the mirrors, (unless read error), until done. Then drop the Mirror vDev. We would not want to maintain a 2 disk Mirror scheme on a RAID-Z2 or RAID-Z3 vDev. So the data would get 3 or 4 copies.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
No, what I meant was removing single disk vDev, (aka striped disk), from a pool consisting of RAID-Zx vDev(s).
Ah, misunderstood you. Yeah, that'd be awesome just to shut up the "btrfs is so much better because it supposedly does this" crowd.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Ah, misunderstood you. Yeah, that'd be awesome just to shut up the "btrfs is so much better because it supposedly does this" crowd.
Oh it's easy to shut the BTRFS people up, (and I used to like and use it!). Simply remind them about RAID-5/6. There are so many problems;
  • BTRFS RAID-5/6 code is not usable, (after more than 4 years of work)
  • BTRFS RAID-5/6 purposefully does not checksum the parity. Not a problem if it's good. But when you have bad parity and are needing it, bye bye data, and you won't even know it. (It's rare as *ell, but ZFS won't make that mistake!)
  • RAID-5/6 write hole still not completely resolved
  • No real thought on how to deal with pools of large numbers of disks. ZFS has the concept of vDevs... BTRFS nope
  • RAID-7 / triple parity not yet on the roadmap. (Someone suggested a new scheme for 1 to 6 disks of parity, insane!)
And last, BTRFS developers seem to be going into wonderland of features. Being able to compress and or de-duplicate on a file by file bases. Not to mention potential dataloss on file renaming, which does not do copy on write. So a crash at the wrong time may make your file go bye bye.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The wiki now sounds less dire than it used to, but I expect that to be 20% real improvement, 80% marketing.
BTRFS nope
Wow, btrfs is the gift that keeps on giving. I've never learned a new thing about btrfs that is good, only bad.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Simply remind them about RAID-5/6.
Even RAID1 is only stable when both disks are online--which kind of defeats the purpose of RAID1.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
...
Wow, btrfs is the gift that keeps on giving. I've never learned a new thing about btrfs that is good, only bad.
Well, here are 2 good things;
  1. When specifying a root FS on BTRFS, you used have to refer to it by it's volume number. This was annoying as hell. For updates, I would create an alternate boot environment using a root FS snapshot. Then have to get the new snapshot's volume number. Next, modify BOTH old and new /etc/fstab, (and grub configuration). Now I could boot off the new ABE's root file system on BTRFS. After a year or so of that, BTRFS came out with an update to allow using the new snapshot's name. That allowed me to decide it's name BEFORE I performed the snaphot. Thus I could modify /etc/fstab once for the ABE section.
  2. Originally I used single disks in my laptop, miniature media server and desktop. BTRFS metadata is duplicated even on a single disk. It took the developers several years to come out with data duplication, but they did. BTRFS even lets you convert single to double, online. Or back to single. Similar to ZFS' copies=2, but on-line changable.
By the time this last feature came out, I had multiple backup schemes, (NAS, tape & removal portable drives). So I was less worried. And not yet trusting of this new feature. Within 2 years I was starting to use ZFS on Linux.
Even RAID1 is only stable when both disks are online--which kind of defeats the purpose of RAID1.
Yes, it's absolutely asinine that you can't reliably boot a RAID-1 Mirror with 1 bad disk. Exactly, that defeats the purpose.
 
Last edited:
Status
Not open for further replies.
Top