How do you deal with the RAID-Z tax?

Status
Not open for further replies.

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Hey folks,

I'm going to run over a quick story about our issue with the RAID-Z tax and how we happened upon it. I don't actually need help with the issue, but what I'm curious about is how I can have been dealing with ZFS for almost a decade now, and never have run across this. I'm wondering if people just don't use RAID-Z for deployments anymore. Does everyone just use mirrors these days?

If you don't know what the RAID-Z tax is, read the story below. If you know what the RAID-Z tax is, I'm curious to know how you're dealing with it. Did you opt to go with mirrors? Are you just eating the additional storage cost? Are you just able to find a reliable source of 512 byte block hard drives?

###Story time!###

We're having an issue with out setup. We run VMs in Zvols on ZFS on Linux. We then replicate them over to a FreeNAS box for backup. Since the FreeNAS box isn't all about performance, it's rocking an 8 disk RAID-Z2. The problem we're running into is that the space used on FreeNAS is more than twice that on the Linux hypervisor. So, I did I little digging and found this gem.

https://serverfault.com/questions/512018/strange-zfs-disk-space-usage-report-for-a-zvol

Apparently there is some kind of tax associated with RAID-Z, which gets worse as you add disks, and depending upon whether you have standard format (512 block size) or advanced format (4096 block size) drives.

###End Story Time!###

So Dan, the person who wrote that wonderful answer on Server Fault, suggests the following:

1. Don't use 4k drives
2. Use zvols with the volblocksize >=32k
3. Prefer stripes of mirrors over RAID-Z

Personally, I'm not aware of a way to reliably purchase 512 byte drives, so option 1 is out for me. Also, as I understand it, we will eventually be rid of 512 byte drives.

Option 2 is definitely worth investigating.

Option 3 doesn't buy us a whole lot. We're using an 8 disk RAID-Z2, which puts us losing just over half the space due to the tax. If we went with a pool of mirrors, we'd still be looking at losing about half the space to the mirror, and we'd be losing redundancy.

So yeah, that's the long long version of me just trying to figure out how people around here are dealing with this RAID-Z tax.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Yeah. #3 is the only option if your running vmware, you are limited to 512k sectors (may have changed with 6.4). You mentioned running your hypervisor on linux, what is it? Just curious.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
This is silly alarmism.

RAIDZ requires padding in some cases. Unless you're dealing with a pathological workload, it's not particularly significant. If you have a very well-defined workload with a static block size and don't use compression for some mysterious reason, size your vdevs to avoid the padding.

We're using an 8 disk RAID-Z2, which puts us losing just over half the space due to the tax.
What? Have you actually tried it? 50% of storage lost to overhead is beyond insane and really demands more proof than some random numbers on stack overflow.
 

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
kdragon. I'm using Proxmox as the hypervisor. Running ZFS on Linux so I can just replicate snapshots over the interwebs.

Eric. Stand by for more proof. If this were just a random answer I stumbled across in Stack, I wouldn't make a whole new post. I'm seeing this problem reliably across multiple systems, and the numbers I'm seeing line up pretty well with the numbers in that Stack answer. I'll get you some numbers as soon as I can. Might be tomorrow though.

How do you size your vdevs to avoid padding, though? I thought FreeNAS did this automatically, and that's part of why there are additional partitions when a disk becomes a member of a vdev.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I thought FreeNAS did this automatically, and that's part of why there are additional partitions when a disk becomes a member of a vdev.
It's not something you automate, it's a function of the vdev itself. For typical power of two block sizes, you want vdevs that are 2^n+p wide, where n is an integer and p is the parity level.
 

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Do you have a link for further reading on that? The only thing I've read on avoiding padding has been in regards to how each individual disk is partitioned. I'd like to read more on this to find out if it's something that we're running into. I think I understand what you're talking about, but I want to be sure.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Ah yes, I've read that before. Fun stuff. We're basically following the advice in the last paragraph. We're using small block sizes, 8k with lz4 compression. On the source zvols, the data is on a 4 disk RAIDZ2. It replicates to our backup FreeNAS boxen, an 8 disk RAIDZ2, and takes up significantly more space. More than double what's actually there. We can replicate it to a 14 disk RAIDZ2, and it takes up yet more space. Replicate it to a single drive, and suddenly it's back to representing the actual size of the data. In our most extreme example, we're seeing a 450GB virtual disk grow to something like 1.8TB.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Oh yeah, 8KB blocks are going to be a disaster with 4k sectors. If you cannot avoid them, you're better off going with triple mirrors (or two-way mirrors, if the extra reliability isn't needed).

My train of thought was that any workload with such a block size would be way too slow on RAIDZ and need mirrors anyway. I forgot about the backup to RAIDZ use case...
 

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Drat, I was hoping I might have missed something. On the one hand, having an accurate reading of storage space is important. On the other hand, with RAIDZ2 we're getting somewhere around 45ish% of our actual storage space. Going to mirrors we'd get like 38%. I was hoping there'd be a better option. I can't imagine that people building larger scale storage systems with ZFS would find this space loss acceptable.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What's feeding you 8KB blocks, anyway?
 

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Virtual machines. We've got a couple of FreeNAS boxen that operate as pretty typical NAS devices with file shares, but they don't occupy a whole heck of a lot of space one way or the other. They're also 128k datasets, rather than 8k zvols.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
All things considered, you might end up gaining space with larger blocks. That's something to look at. Even 16KB blocks would reduce the overhead (parity+padding) from 67% to 33%.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Do you have a link for further reading on that? The only thing I've read on avoiding padding has been in regards to how each individual disk is partitioned. I'd like to read more on this to find out if it's something that we're running into. I think I understand what you're talking about, but I want to be sure.

@Ericloewe already answered but something you may find interesting: https://forums.freenas.org/index.php?threads/misaligned-pools-and-lost-space.40288/
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Personally, I'm not aware of a way to reliably purchase 512 byte drives, so option 1 is out for me. Also, as I understand it, we will eventually be rid of 512 byte drives.
You have to order drives that are marked as 512n (512 native) like these from Seagate:
https://www.newegg.com/Product/Product.aspx?item=N82E16822179049
It may be true that they go away in the future, but they are readily available now. May applications like yours exist.

The 'tax' you are talking about is not because of RAIDz, it is because of the structure of your data. I have a storage server where I work that allows me to cram 265TB of data into 110TB of physical drive space using RAIDz2 vdevs that are 15 drives wide.
The problem is the kind of data you are storing and the way you are storing it.
We're using small block sizes, 8k with lz4 compression.
Are you sure that you need to do that?
 

praecorloth

Contributor
Joined
Jun 2, 2011
Messages
159
Are you sure that you need to do that?

Well, we definitely need to do that on the hypervisor side for performance. I suppose I could try and catch the replication when it starts for the very first time, immediately after it's created the zvol on the receiving side, and set the block size to something larger. Or were you talking about the compression?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Well, we definitely need to do that on the hypervisor side for performance. I suppose I could try and catch the replication when it starts for the very first time, immediately after it's created the zvol on the receiving side, and set the block size to something larger. Or were you talking about the compression?
Block size.
 
Status
Not open for further replies.
Top