The problem with RAIDZ

Jamberry · Nov 28, 2023

The problem with RAIDZ or why you probably won't get the storage efficiency you think you will get.

As a ZFS rookie, I struggled a fair bit to find out what settings I should use for my Proxmox hypervisor. To learn more about ZFS and help other rookies, I wrote down this wall of text. Although it is more about using ZFS as a filesystem for Proxmox, I posted it here because I know that some people here in to forum are very outspoken and extremely knowledgeable when it comes to ZFS (@jgreco). I will write this as facts, when in reality it is more of a draft I hope that someone proofreads.

Before we start, we have to learn about some ZFS glossary. These are important to understand the examples later on.

sector size:
older HDDs used to have a sector size of 512b, while newer HDDs have 4k sectors. SSDs can have even bigger sectors, but their firmware controllers are mostly tuned for 4k sectors. There are still enterprise HDDs that come with 512e, where the "e" stands for emulation. These are not 512b but 4k drives, they only emulate to be 512. For this whole text, I assume that we have drives with 4k sectors.

ashift:
ashift sets the sector size, ZFS should use. ashift is a power of 2, so setting ashift=12 will result in 4k. Ashift must match your drive's sector size. Extremely likely this will be 12 and also automatically detected.

dataset:
A dataset is inside a pool and is like a file system. There can be multiple datasets in the same pool, and each dataset has its own settings like compression, dedup, quota, and many more. They also can have child datasets that by default inherit the parent's settings. Datasets are useful to create a network share or create a mount point for local files. In Proxmox, datasets are mostly used locally for ISO images, container templates, and VZdump backup files.

zvol:
zvols or ZFS volumes are also inside a pool. Rather than mounting as a file system, it exposes a block device under /dev/zvol/poolname/dataset. This allows to back disks of virtual machines or to make it available to other network hosts using iSCSI. In Proxmox, zvols are mostly used for disk images and containers.

recordsize:
Recordsize applies to datasets. ZFS datasets use by default a recordsize of 128KB. It can be set between 512b to 16MB (1MB before openZFS v2.2).

volblocksize:
Zvols have a volblocksize property that is analogous to recordsize.
Since openZFS v2.2 the default value is 16k, while Proxmox 8.1 still uses 8k as default.

Now with that technical stuff out of the way, let's look at real-life examples.

First, let us look at datasets and their recordsize. Datasets are very different from zvols because recordsize only sets the biggest possible blocksize, while for zvol the volblocksize will set every single block to exactly that size. So recordsize is dynamic while volblocksize is fixed.
The default recordsize is 128kb. Bigger files will be split up into 128kb chunks.

Let's look at an example of a dataset with the default recordsize of 128k and how that would work. We assume that we write a file 128k in size (after compression).

For a 3-disk wide RAIDZ1, the total stripe width is 3.
One stripe has 2 data blocks and 1 parity block. Each is 4k in size.
So one stripe has 8k data blocks and a 4k parity block.
To store a 128k file, we need 128k / 4k = 32 data blocks.
Because each stripe has two data blocks, we need 16 stripes for our 32 data blocks.
Each of these stripes has two 4k data blocks and a 4k parity block.
In total, we store 128k data blocks (16 stripes * 8k data blocks)
and 64k parity blocks (16 stripes * 4k parity).
Data blocks + parity blocks = total blocks
128k + 64k = 192k.
That means we write 192k in blocks to store a 128k file.
128k / 192k = 66.66% storage efficiency.

This is a best-case scenario. Just like one would expect from a 3-wide RAID5 or RAIDZ1, you "lose" a third of storage.

Now, what happens if the file is smaller than the recordsize of 128? A 20k file?

We calculate the same thing for our 20k file.
20k divided by 8k (2 data parts, each 4k) = 2.5 stripes. Half-data stripes are impossible. So we need 3 stripes to store our data.

The first stripe has 8k data blocks and a 4k parity block.
The second stripe has 8k data blocks and a 4k parity block.
The third stripe is special.
We already saved 16k of data in the first two blocks, so we only have 4k data left to save.
That is why the third stripe has a 4k data block and a 4k parity block.

Now the efficiency has changed. If we calculate all together, we wrote 20k data blocks, 12k parity blocks, and one 4k padding block.
We wrote 32k to store a 20k file.
20k / 32k = 62.5% storage efficiency.

This is not what you intuitively would expect. What happens if the situation gets even worse and we wanna save a 4k file?

We calculate the same thing for a 4k file.
We simply store a 4k data block on one disk and one parity block on another disk. In total, we wrote a 4k data block and a 4k parity block.
We wrote 8k in blocks to store a 4k file.
4k / 8k = 50% storage efficiency.

This is the same storage efficiency we would expect from a mirror.
That should explain the subtitle: "Why you probably won't get the storage efficiency you think you will get".
It does't apply to you, if you have a 3-wide RAIDZ1 and only write files where the size is a multiple of 8k. For huge files like pictures, movies, and songs, the efficiency loss for not being exactly a multiple of 8k, the loss gets smaller and is negligible.

For Proxmox we mostly don't use datasets though. We use VMs with RAW disks that are stored on a Zvol.
For zvols and their fixed volblocksize, it gets more complicated.

As far as I understand it, in the early days, the default volblocksize was 8k and it was recommended to turn off compression. This was due to Solaris using 8k. Nowadays, it is recommended to enable compression and the current default is 16k since v2.2. Still not the default in Proxmox though. Some people in the forum recommend going as high as 64k on SSDs.

In theory, you wanna have writes that exactly match your volblocksize.
If the write is bigger than your volblocksize, you lose out on compression gains.
If the write is smaller than your volblocksize, you will have IO amplification, waste space (especially on RAIDZ), and will produce more fragmentation.

For the first part, let's assume that your volblocksize is the Proxmox default of 8k. Let us look at the different file sizes we want to write and how they behave on different pools.

First, we want to write a 4k file.

mirror: 8k data block and one 8k parity block. 16k total write to store 4k. 25% storage efficiency (expected 50%).

RAIDZ1 3-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 4k. 25% storage efficiency (expected 66%).

RAIDZ1 4-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 4k. 25% storage efficiency (expected 75%).

Conclusion: This is not great and shows what happens if the volblocksize is bigger than the filesize we write. Storage efficiency is very bad, we get IO amplification and fragmentation.

Next, we want to write an 8k file.

mirror: 8k data block and one 8k parity block. 16k total write to store 8k. 50% storage efficiency (expected 50%).

RAIDZ1 3-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 8k. 50% storage efficiency (expected 66%).

RAIDZ1 4-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 8k. 50% storage efficiency (expected 75%).

Conclusion: For mirros, this works perfectly. For RAIDZ1 even though the write matches the volblocksize, storage efficiency, is still poor.

Next, we want to write a 16k file.

mirror: 16k data blocks and 16k parity blocks. 32k total write to store 16k. 50% storage efficiency (expected 50%).

RAIDZ1 3-wide: Each stripe has two 8k data blocks and one 8k parity block. We need one stripe to store 16k. One stripe is 24k.
24k total to store 16k. 66.66% storage efficiency (expected 66.66%).

RAIDZ1 4-wide: Each stripe has three 8k data blocks and one 8k parity block. We don't even need the whole stripe to store 16k. Instead, we can shorten the stripe to two 8k data blocks and one 8k parity block. One stripe is then only 24k instead of 32k.
24k total to store 16k. 66.66% storage efficiency (expected 75%).

Conclusion: For mirrors and a 3-wide RAIDZ1, this works perfectly. 4-wide RAIDZ1 geometry is still not optimal and offers less efficiency than expected.

Next, we want to write a 128k file.

mirror: 128k data blocks and 128k parity blocks. 256k total write to store 128k. 50% storage efficiency (expected 50%).

RAIDZ1 3-wide: Each stripe has two 8k data blocks and one 8k parity block. 128k/16k data blocks = 8. We need 8 stripes to store 128k.
One stripe is 24k. Multiplied by 8 stripes = 192k.
192k total to store 128k. 66.66% storage efficiency (expected 66.66%).

RAIDZ1 4-wide: Each stripe has three 8k data blocks and one 8k parity block. 128k/24k data blocks = 5.33 We need 6 stripes to store 128k.
The first five stripes each store 24k data blocks and an 8k parity block.
That is 24*5 = 120k of data blocks, which means the last stripe has to save 8k of data.
The sixth stripe stores an 8k data block and an 8k parity block.
In total, we store 120 + 40 + 8 + 8 = 176k to store 12k. 72.72% storage efficiency (expected 75%).

Conclusion: For mirrors and a 3-wide RAIDZ1, this works perfectly. 4-wide RAIDZ1 geometry is still not optimal but close.

For the second part, let's assume that we change the default volblocksize of Proxmox. Let's use the volblocksize of 64k, which some people (mostly SSD users) recommend in the forum.

First, we want to write a 4k file.

mirror: 64k data block and one 64k parity block. 128k total write to store 4k. 3.1% storage efficiency (expected 50%).

RAIDZ1 3-wide: 64k data block written to one disk and 64k to another disk for parity. 128k blocks total to store 4k. 3.1% storage efficiency (expected 66%).

RAIDZ1 4-wide: 64k data block written to one disk and 64k to another disk for parity. 128k blocks total to store 4k. 3.1% storage efficiency (expected 75%).

Conclusion: This is bad and shows what happens if the volblocksize is bigger than the filesize we write. Storage efficiency is very bad, we get extreme IO amplification and fragmentation.

We skip the other sizes and use a 1024k file.

mirror: 1024k data blocks and 1024k parity blocks. 2048k total write to store 1024k. 50% storage efficiency (expected 50%).

RAIDZ1 3-wide: Each stripe has two 64k data blocks and one 64k parity block. 1024k/128k data blocks = 8. We need 8 stripes to store 1024k.
Each stripe stores 128k data blocks and a 64k parity block.
In total, we store 192*8= 1536k for a 1024k file.
66.66% storage efficiency (expected 66.66%).

RAIDZ1 4-wide: Each stripe has three 64k data blocks and one 64k parity block. 1024k/192k data blocks = 5.3. We need 6 stripes to store 1024k.
The first 5 stripes each store 192k data blocks and a 64k parity block.
That is 192*5 = 960k of data blocks, which means the last stripe has to save 64k (1024-960) of data.
The sixth stripe stores a 64k data block and a 64k parity block.
In total, we store 960 + 320 + 64 + 64 = 1408k to store 1024k.
72.72% storage efficiency (expected 75%).

Conclusion: For mirrors and a 3-wide RAIDZ1, this works perfectly. 4-wide RAIDZ1 geometry is not optimal. Did you notice something? This is the same behavior we witnessed with a volblocksize of 8k and a file size of 128k earlier. That is because the ratio is the same. Both have a suboptimal pool geometry for stripes.
128k file / 24 data blocks = 5.33
1024 file / 192 data blocks = 5.33

Overall conclusion:
RAIDZ almost always does not offer the storage efficiency you think you will get, if you deal with block storage or smaller files. This is why Proxmox simply recommends using mirrors. You will get better performance and will probably not lose that much more storage. If you use any kind of block storage, go with mirrors.

It is already late and English is not my native language. I hope there are not too many errors in this.

jgreco · Nov 28, 2023

Jamberry said:
very outspoken and extremely knowledgeable when it comes to ZFS (@jgreco).

I may be very outspoken but please don't consider me extremely knowledgeable. I just try not to talk about things that I'm not knowledgeable about, which has the side effect of making me look very knowledgeable about the rest. My solution to ol' Abe's "Better to remain silent and be thought a fool than to speak and to remove all doubt".

You've neatly compiled a bunch of really useful stuff that's hard to come by piecemeal. I tip my hat to you. Doing your homework is difficult, and you clearly did it. If that's not a familiar idiom to you, it means that you did what looks like a very nice job.

Ericloewe · Nov 28, 2023

I'd caveat the conclusion more clearly to narrow it to block storage. RAIDZ is pretty handy for storing large files.

winnielinnie · Nov 28, 2023

jgreco said:
I just try not to talk about things that I'm not knowledgeable about, which has the side effect of making me look very knowledgeable about the rest.

I'm the same way. I do everything I can to avoid appearing ignorant. That's why I subscribe to the philosophy of the 13th century poet, who later became the third president of the United States, Benjamin Franklin: "If an answer thou mind not host, then it be best thou not post."

Ericloewe · Nov 28, 2023

Your signature needs work:

~~^ ----- what you just read above this line is my opinion ----- ^~~

^ ----- what you just read above this line is probably wrong ----- ^

winnielinnie · Nov 28, 2023

Your comparison of mirrors vs RAIDZ seems to illustrate the "buyer's remorse" of some people who jump into RAIDZ vdevs.

It's as you implied: When you create a mirror vdev, performance and space efficiency are intuitively known and expected. Whereas with RAIDZ, it gets more contrived and requires calculations based on your width + recordsize + RAIDZ level.

(Then there's the performance and simplified "ease of maintenance" that mirrors offer.)

EDIT:

Jamberry said:
It is already late and English is not my native language. I hope there are not too many errors in this.

Your English is fine and understandable.

My only suggestion would be to use "headers" and text formatting to make it easier to read and find information, especially for a lengthy post. (It's easier on the eyes when things are broken down into visible "sections".)

jgreco · Nov 28, 2023

winnielinnie said:
third president of the United States, Benjamin Franklin: "If an answer thou mind not host, then it be best thou not post."

Someone else can always say it better.

That's why I like hanging around these forums, lots of smart people....

AlexGG · Nov 28, 2023

Jamberry said:
For the first part, let's assume that your volblocksize is the Proxmox default of 8k. Let us look at the different file sizes we want to write and how they behave on different pools.

First, we want to write a 4k file.

If you are writing into a ZVOL, you need to take into account the hosted filesystem. For example, if you host NTFS with 4KB per cluster, your 8KB ZVOL block may well contain two 4KB files, each one occupying a single NTFS 4KB cluster, and the calculation becomes much more complicated.

Jamberry · Nov 29, 2023

Thank you guys for your inputs and feedback! Greatly appreciated.

jgreco said:
I may be very outspoken but please don't consider me extremely knowledgeable.

You are not afraid of calling out bs and give an honest opinion. Ignorant people may find that arrogant, but I like that you don't beat around the bush.

Ericloewe said:
I'd caveat the conclusion more clearly to narrow it to block storage. RAIDZ is pretty handy for storing large files.

I edited the final conclusion a bit to make that point more clear.

winnielinnie said:
My only suggestion would be to use "headers" and text formatting to make it easier to read and find information, especially for a lengthy post.

Fully agree, unfortunately structuring something like this is not my strong suit. I will format this text, but I will format it on Github, simply because I think Github is better suited to create an issue than here. I will probably hide the calculations behind spoilers, that way someone could read only the results and skip everything else.

AlexGG said:
If you are writing into a ZVOL, you need to take into account the hosted filesystem. For example, if you host NTFS with 4KB per cluster, your 8KB ZVOL block may well contain two 4KB files, each one occupying a single NTFS 4KB cluster, and the calculation becomes much more complicated.

That brings me to my open questions of stuff I still don't really understand. I even tried to ask these questions in the Proxmox forum but did not get an answer (also a reason why I came here to discuss this, this forum seems to be more active).

As far as I understand it, most VMs use a 4k filesystem.
Windows uses 4k NTFS, and Linux uses 4k ext4.
At first, I thought, that meant that every single write of the guest has to be split up into 4k. If a Windows guest writes 16k, the guest does this by writing 4 times a 4k (because that is the size of the FS).
This seems to be not true and file systems have an extent? What is an extent? I probably need to look into inodes and how FS work...

Another thing I don't understand is the padding stuff. I was unable to come up with a situation where padding happens. Would that not make things even worse? Like for example an 4k write on 4k volblocksize on a 3-wide RAIDZ1. We write a 4k data block to the first disk and a 4k parity block to the second disk. That already is pretty bad and only 50% efficient. If we now would even use padding to use the whole stripe, we would write 4k data to the first disk, a 4k parity block to the second disk, and a 4k padding block to the third disk? That would get us even worse 33%. I am pretty sure I am misunderstanding something here.

This is also something I noticed in the official OpenZFS docs among other things I tried to improve.

RAIDZ space efficiency by jameskimmel · Pull Request #475 · openzfs/openzfs-docs

I am pretty new to RAIDZ1 and maybe I am way off here. I think it is not a padding block, but a parity block. I also think that there is a an error in the raidz1 calculations. we will have 128K/2 =...

github.com

AlexGG · Nov 29, 2023

Jamberry said:
This seems to be not true and file systems have an extent? What is an extent? I probably need to look into inodes and how FS work...

Extent is a contiguous range of blocks (clusters). While ZFS allocates data in units of recordsize (or less, if compressed), EXT4, NTFS, and others have no maximum size for the allocation. So one metadata entry can describe any size of the data stream, as long as it is contiguous.

Davvo · Nov 29, 2023

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

ZFS is a complicated, powerful system. Unfortunately, it isn't actually magic, and there's a lot of opportunity for disappointment if you don't understand what's going on. RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk...

www.truenas.com

HoneyBadger · Nov 30, 2023

Jamberry said:
Damn, now I am confused again

So this is where the padding comes into play?
For the 6-wide Z2, a single stripe is 6 drives wide.
Now if we want to store 12k, I assumed we use 3 disks for data and 2 disks for parity.
We reduced the stripe from 6 down to 5.
So we need 5*4k=20k to store 12k, which is 60%. Not great, but better than your 50%.
Now you @HoneyBadger are telling me, that ZFS can't do that?
We simply waste one disk with padding?

The padding is used to avoid creating an "unfillable hole" later - in this case, that last 4K on "disk 6" so to speak wouldn't be able to store parity adjacent to itself. It's complicated, but largely washes out if your records are larger than a "full stripe."

As @jgreco said, this is one of the advantages of mirrors for the smaller-sized I/O patterns. Narrower width RAIDZ like 3wZ1 does better at this because the size of a "full stripe" is now 8K instead of the 16K of a 6wZ2, but mirrors still get the advantage when it's down to single record sized operations.

Jamberry · Dec 1, 2023

Edit: I think we can ignore this post and jump to my next one!

Thank you guys! But that means that my examples are wrong and I have to rewrite most of it. Will probably do that on the weekend and also try to format it.

HoneyBadger said:
The padding is used to avoid creating an "unfillable hole" later - in this case, that last 4K on "disk 6" so to speak wouldn't be able to store parity adjacent to itself. It's complicated, but largely washes out if your records are larger than a "full stripe."

Could you not fill that whole, by shifting the next stripe to the left? Like in the image down below, we shifted to green stripe one to the left, so it can make use of the "shortened" yellow stripe.

So if we look at this image and assume ashift=12 and a 5-wide RAIDZ1
Orange: We wanna store 32k. That is two stripes. Each one parity and 4 data blocks.
Yellow: We wanna store 12k. This one is not possible in reality according to you @HoneyBadger? In reality, E2 would be used for padding?

Jamberry · Dec 1, 2023

I think I got it wrong by looking at the guest writes first instead of the other way round.
I need to think of zvol first or from a zvol perspective.
Because that block storage is what is offered to the guests.

For a better understanding, I tried to rewrite

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

ZFS is a complicated, powerful system. Unfortunately, it isn't actually magic, and there's a lot of opportunity for disappointment if you don't understand what's going on. RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk...

www.truenas.com

but with modern ashift=12 numbers

Because you've got an 8-wide Z2, your optimal block size is six drives wide.

The only case where allocation happens optimally is where you have a block size of 6 * your sector size, or some multiple thereof. So if your ashift=12, then that's a 24k block. These are not natural sizes. You only get the natural sizes if you follow the old power rules for the number of disks in a pool (RAIDZ2 = 6 or 10 disks, etc).

Small volblocksizes are a problem. The most obvious one is where you have a 4K zvol volblocksize and you store a 4K block. You need two 4K parity to protect that, so you're instantly at 33.33% efficiency.

(This is where I think is an error in the original post. @jgreco writes

The most obvious one is where you have a 4K zvol volblocksize and you store a 4K block. You need a 4K parity to protect that

But how would that survive a any two drives failing? Does that not only use two drives?)

So the 8K zvol volblocksize default of Proxmox seems like it might work out better, but in reality you need two sectors for data, two for parity, so you're at 50% efficiency. You actually have to get up into larger volblocksizes and design them in concert with your pool in order to have a chance at a more optimal configuration. As you raise the volblocksize, you will approach a more optimal situation, but not actually reach it.

So then as an example of how this goes awry, if your ashift=12 and you have a zvol volblocksize of 32K, you need to allocate an (optimal) 24k block on eight sectors, but then you have a leftover of two 4k sectors, which needs to be protected by parity, so that's four sectors, so you're using twelve total 4k sectors (48k) to write that 32K block, or 50% overhead. No matter what the guest VM does, you will never get more than 66.66% storage efficiency in that case!
If your ashift=12 and a zvol of 64K, that's two optimal blocks on sixteen sectors with a leftover of four 4k, plus parity/pad, six more sectors, so that's twenty two sectors (88k to write that 64K block), or 37.5% overhead (or 72% storage efficiency). As you raise the volblocksize, you will approach a more optimal situation, but not actually reach it. Ans as you raise the volblocksize, you will also raise IO amplification and fragmentation.

* this explains why and when there is a need for padding.

Important Announcement for the TrueNAS Community.

The problem with RAIDZ

Contributor

Resident Grinch

Server Wrangler

MVP

Server Wrangler

MVP

Resident Grinch

Contributor

Contributor

Contributor

MVP

Contributor

Contributor

Contributor

actually does care

Contributor

Resident Grinch

actually does care

Contributor

Contributor

Similar threads