The problem with RAIDZ or why you probably won't get the storage efficiency you think you will get.
As a ZFS rookie, I struggled a fair bit to find out what settings I should use for my Proxmox hypervisor. To learn more about ZFS and help other rookies, I wrote down this wall of text. Although it is more about using ZFS as a filesystem for Proxmox, I posted it here because I know that some people here in to forum are very outspoken and extremely knowledgeable when it comes to ZFS (@jgreco). I will write this as facts, when in reality it is more of a draft I hope that someone proofreads.
Before we start, we have to learn about some ZFS glossary. These are important to understand the examples later on.
sector size:
older HDDs used to have a sector size of 512b, while newer HDDs have 4k sectors. SSDs can have even bigger sectors, but their firmware controllers are mostly tuned for 4k sectors. There are still enterprise HDDs that come with 512e, where the "e" stands for emulation. These are not 512b but 4k drives, they only emulate to be 512. For this whole text, I assume that we have drives with 4k sectors.
ashift:
ashift sets the sector size, ZFS should use. ashift is a power of 2, so setting ashift=12 will result in 4k. Ashift must match your drive's sector size. Extremely likely this will be 12 and also automatically detected.
dataset:
A dataset is inside a pool and is like a file system. There can be multiple datasets in the same pool, and each dataset has its own settings like compression, dedup, quota, and many more. They also can have child datasets that by default inherit the parent's settings. Datasets are useful to create a network share or create a mount point for local files. In Proxmox, datasets are mostly used locally for ISO images, container templates, and VZdump backup files.
zvol:
zvols or ZFS volumes are also inside a pool. Rather than mounting as a file system, it exposes a block device under /dev/zvol/poolname/dataset. This allows to back disks of virtual machines or to make it available to other network hosts using iSCSI. In Proxmox, zvols are mostly used for disk images and containers.
recordsize:
Recordsize applies to datasets. ZFS datasets use by default a recordsize of 128KB. It can be set between 512b to 16MB (1MB before openZFS v2.2).
volblocksize:
Zvols have a volblocksize property that is analogous to recordsize.
Since openZFS v2.2 the default value is 16k, while Proxmox 8.1 still uses 8k as default.
Now with that technical stuff out of the way, let's look at real-life examples.
First, let us look at datasets and their recordsize. Datasets are very different from zvols because recordsize only sets the biggest possible blocksize, while for zvol the volblocksize will set every single block to exactly that size. So recordsize is dynamic while volblocksize is fixed.
The default recordsize is 128kb. Bigger files will be split up into 128kb chunks.
Let's look at an example of a dataset with the default recordsize of 128k and how that would work. We assume that we write a file 128k in size (after compression).
For a 3-disk wide RAIDZ1, the total stripe width is 3.
One stripe has 2 data blocks and 1 parity block. Each is 4k in size.
So one stripe has 8k data blocks and a 4k parity block.
To store a 128k file, we need 128k / 4k = 32 data blocks.
Because each stripe has two data blocks, we need 16 stripes for our 32 data blocks.
Each of these stripes has two 4k data blocks and a 4k parity block.
In total, we store 128k data blocks (16 stripes * 8k data blocks)
and 64k parity blocks (16 stripes * 4k parity).
Data blocks + parity blocks = total blocks
128k + 64k = 192k.
That means we write 192k in blocks to store a 128k file.
128k / 192k = 66.66% storage efficiency.
This is a best-case scenario. Just like one would expect from a 3-wide RAID5 or RAIDZ1, you "lose" a third of storage.
Now, what happens if the file is smaller than the recordsize of 128? A 20k file?
We calculate the same thing for our 20k file.
20k divided by 8k (2 data parts, each 4k) = 2.5 stripes. Half-data stripes are impossible. So we need 3 stripes to store our data.
The first stripe has 8k data blocks and a 4k parity block.
The second stripe has 8k data blocks and a 4k parity block.
The third stripe is special.
We already saved 16k of data in the first two blocks, so we only have 4k data left to save.
That is why the third stripe has a 4k data block and a 4k parity block.
Now the efficiency has changed. If we calculate all together, we wrote 20k data blocks, 12k parity blocks, and one 4k padding block.
We wrote 32k to store a 20k file.
20k / 32k = 62.5% storage efficiency.
This is not what you intuitively would expect. What happens if the situation gets even worse and we wanna save a 4k file?
We calculate the same thing for a 4k file.
We simply store a 4k data block on one disk and one parity block on another disk. In total, we wrote a 4k data block and a 4k parity block.
We wrote 8k in blocks to store a 4k file.
4k / 8k = 50% storage efficiency.
This is the same storage efficiency we would expect from a mirror.
That should explain the subtitle: "Why you probably won't get the storage efficiency you think you will get".
It does't apply to you, if you have a 3-wide RAIDZ1 and only write files where the size is a multiple of 8k. For huge files like pictures, movies, and songs, the efficiency loss for not being exactly a multiple of 8k, the loss gets smaller and is negligible.
For Proxmox we mostly don't use datasets though. We use VMs with RAW disks that are stored on a Zvol.
For zvols and their fixed volblocksize, it gets more complicated.
As far as I understand it, in the early days, the default volblocksize was 8k and it was recommended to turn off compression. This was due to Solaris using 8k. Nowadays, it is recommended to enable compression and the current default is 16k since v2.2. Still not the default in Proxmox though. Some people in the forum recommend going as high as 64k on SSDs.
In theory, you wanna have writes that exactly match your volblocksize.
If the write is bigger than your volblocksize, you lose out on compression gains.
If the write is smaller than your volblocksize, you will have IO amplification, waste space (especially on RAIDZ), and will produce more fragmentation.
For the first part, let's assume that your volblocksize is the Proxmox default of 8k. Let us look at the different file sizes we want to write and how they behave on different pools.
First, we want to write a 4k file.
mirror: 8k data block and one 8k parity block. 16k total write to store 4k. 25% storage efficiency (expected 50%).
RAIDZ1 3-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 4k. 25% storage efficiency (expected 66%).
RAIDZ1 4-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 4k. 25% storage efficiency (expected 75%).
Conclusion: This is not great and shows what happens if the volblocksize is bigger than the filesize we write. Storage efficiency is very bad, we get IO amplification and fragmentation.
Next, we want to write an 8k file.
mirror: 8k data block and one 8k parity block. 16k total write to store 8k. 50% storage efficiency (expected 50%).
RAIDZ1 3-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 8k. 50% storage efficiency (expected 66%).
RAIDZ1 4-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 8k. 50% storage efficiency (expected 75%).
Conclusion: For mirros, this works perfectly. For RAIDZ1 even though the write matches the volblocksize, storage efficiency, is still poor.
Next, we want to write a 16k file.
mirror: 16k data blocks and 16k parity blocks. 32k total write to store 16k. 50% storage efficiency (expected 50%).
RAIDZ1 3-wide: Each stripe has two 8k data blocks and one 8k parity block. We need one stripe to store 16k. One stripe is 24k.
24k total to store 16k. 66.66% storage efficiency (expected 66.66%).
RAIDZ1 4-wide: Each stripe has three 8k data blocks and one 8k parity block. We don't even need the whole stripe to store 16k. Instead, we can shorten the stripe to two 8k data blocks and one 8k parity block. One stripe is then only 24k instead of 32k.
24k total to store 16k. 66.66% storage efficiency (expected 75%).
Conclusion: For mirrors and a 3-wide RAIDZ1, this works perfectly. 4-wide RAIDZ1 geometry is still not optimal and offers less efficiency than expected.
Next, we want to write a 128k file.
mirror: 128k data blocks and 128k parity blocks. 256k total write to store 128k. 50% storage efficiency (expected 50%).
RAIDZ1 3-wide: Each stripe has two 8k data blocks and one 8k parity block. 128k/16k data blocks = 8. We need 8 stripes to store 128k.
One stripe is 24k. Multiplied by 8 stripes = 192k.
192k total to store 128k. 66.66% storage efficiency (expected 66.66%).
RAIDZ1 4-wide: Each stripe has three 8k data blocks and one 8k parity block. 128k/24k data blocks = 5.33 We need 6 stripes to store 128k.
The first five stripes each store 24k data blocks and an 8k parity block.
That is 24*5 = 120k of data blocks, which means the last stripe has to save 8k of data.
The sixth stripe stores an 8k data block and an 8k parity block.
In total, we store 120 + 40 + 8 + 8 = 176k to store 12k. 72.72% storage efficiency (expected 75%).
Conclusion: For mirrors and a 3-wide RAIDZ1, this works perfectly. 4-wide RAIDZ1 geometry is still not optimal but close.
For the second part, let's assume that we change the default volblocksize of Proxmox. Let's use the volblocksize of 64k, which some people (mostly SSD users) recommend in the forum.
First, we want to write a 4k file.
mirror: 64k data block and one 64k parity block. 128k total write to store 4k. 3.1% storage efficiency (expected 50%).
RAIDZ1 3-wide: 64k data block written to one disk and 64k to another disk for parity. 128k blocks total to store 4k. 3.1% storage efficiency (expected 66%).
RAIDZ1 4-wide: 64k data block written to one disk and 64k to another disk for parity. 128k blocks total to store 4k. 3.1% storage efficiency (expected 75%).
Conclusion: This is bad and shows what happens if the volblocksize is bigger than the filesize we write. Storage efficiency is very bad, we get extreme IO amplification and fragmentation.
We skip the other sizes and use a 1024k file.
mirror: 1024k data blocks and 1024k parity blocks. 2048k total write to store 1024k. 50% storage efficiency (expected 50%).
RAIDZ1 3-wide: Each stripe has two 64k data blocks and one 64k parity block. 1024k/128k data blocks = 8. We need 8 stripes to store 1024k.
Each stripe stores 128k data blocks and a 64k parity block.
In total, we store 192*8= 1536k for a 1024k file.
66.66% storage efficiency (expected 66.66%).
RAIDZ1 4-wide: Each stripe has three 64k data blocks and one 64k parity block. 1024k/192k data blocks = 5.3. We need 6 stripes to store 1024k.
The first 5 stripes each store 192k data blocks and a 64k parity block.
That is 192*5 = 960k of data blocks, which means the last stripe has to save 64k (1024-960) of data.
The sixth stripe stores a 64k data block and a 64k parity block.
In total, we store 960 + 320 + 64 + 64 = 1408k to store 1024k.
72.72% storage efficiency (expected 75%).
Conclusion: For mirrors and a 3-wide RAIDZ1, this works perfectly. 4-wide RAIDZ1 geometry is not optimal. Did you notice something? This is the same behavior we witnessed with a volblocksize of 8k and a file size of 128k earlier. That is because the ratio is the same. Both have a suboptimal pool geometry for stripes.
128k file / 24 data blocks = 5.33
1024 file / 192 data blocks = 5.33
Overall conclusion:
RAIDZ almost always does not offer the storage efficiency you think you will get, if you deal with block storage or smaller files. This is why Proxmox simply recommends using mirrors. You will get better performance and will probably not lose that much more storage. If you use any kind of block storage, go with mirrors.
It is already late and English is not my native language. I hope there are not too many errors in this.
As a ZFS rookie, I struggled a fair bit to find out what settings I should use for my Proxmox hypervisor. To learn more about ZFS and help other rookies, I wrote down this wall of text. Although it is more about using ZFS as a filesystem for Proxmox, I posted it here because I know that some people here in to forum are very outspoken and extremely knowledgeable when it comes to ZFS (@jgreco). I will write this as facts, when in reality it is more of a draft I hope that someone proofreads.
Before we start, we have to learn about some ZFS glossary. These are important to understand the examples later on.
sector size:
older HDDs used to have a sector size of 512b, while newer HDDs have 4k sectors. SSDs can have even bigger sectors, but their firmware controllers are mostly tuned for 4k sectors. There are still enterprise HDDs that come with 512e, where the "e" stands for emulation. These are not 512b but 4k drives, they only emulate to be 512. For this whole text, I assume that we have drives with 4k sectors.
ashift:
ashift sets the sector size, ZFS should use. ashift is a power of 2, so setting ashift=12 will result in 4k. Ashift must match your drive's sector size. Extremely likely this will be 12 and also automatically detected.
dataset:
A dataset is inside a pool and is like a file system. There can be multiple datasets in the same pool, and each dataset has its own settings like compression, dedup, quota, and many more. They also can have child datasets that by default inherit the parent's settings. Datasets are useful to create a network share or create a mount point for local files. In Proxmox, datasets are mostly used locally for ISO images, container templates, and VZdump backup files.
zvol:
zvols or ZFS volumes are also inside a pool. Rather than mounting as a file system, it exposes a block device under /dev/zvol/poolname/dataset. This allows to back disks of virtual machines or to make it available to other network hosts using iSCSI. In Proxmox, zvols are mostly used for disk images and containers.
recordsize:
Recordsize applies to datasets. ZFS datasets use by default a recordsize of 128KB. It can be set between 512b to 16MB (1MB before openZFS v2.2).
volblocksize:
Zvols have a volblocksize property that is analogous to recordsize.
Since openZFS v2.2 the default value is 16k, while Proxmox 8.1 still uses 8k as default.
Now with that technical stuff out of the way, let's look at real-life examples.
First, let us look at datasets and their recordsize. Datasets are very different from zvols because recordsize only sets the biggest possible blocksize, while for zvol the volblocksize will set every single block to exactly that size. So recordsize is dynamic while volblocksize is fixed.
The default recordsize is 128kb. Bigger files will be split up into 128kb chunks.
Let's look at an example of a dataset with the default recordsize of 128k and how that would work. We assume that we write a file 128k in size (after compression).
For a 3-disk wide RAIDZ1, the total stripe width is 3.
One stripe has 2 data blocks and 1 parity block. Each is 4k in size.
So one stripe has 8k data blocks and a 4k parity block.
To store a 128k file, we need 128k / 4k = 32 data blocks.
Because each stripe has two data blocks, we need 16 stripes for our 32 data blocks.
Each of these stripes has two 4k data blocks and a 4k parity block.
In total, we store 128k data blocks (16 stripes * 8k data blocks)
and 64k parity blocks (16 stripes * 4k parity).
Data blocks + parity blocks = total blocks
128k + 64k = 192k.
That means we write 192k in blocks to store a 128k file.
128k / 192k = 66.66% storage efficiency.
This is a best-case scenario. Just like one would expect from a 3-wide RAID5 or RAIDZ1, you "lose" a third of storage.
Now, what happens if the file is smaller than the recordsize of 128? A 20k file?
We calculate the same thing for our 20k file.
20k divided by 8k (2 data parts, each 4k) = 2.5 stripes. Half-data stripes are impossible. So we need 3 stripes to store our data.
The first stripe has 8k data blocks and a 4k parity block.
The second stripe has 8k data blocks and a 4k parity block.
The third stripe is special.
We already saved 16k of data in the first two blocks, so we only have 4k data left to save.
That is why the third stripe has a 4k data block and a 4k parity block.
Now the efficiency has changed. If we calculate all together, we wrote 20k data blocks, 12k parity blocks, and one 4k padding block.
We wrote 32k to store a 20k file.
20k / 32k = 62.5% storage efficiency.
This is not what you intuitively would expect. What happens if the situation gets even worse and we wanna save a 4k file?
We calculate the same thing for a 4k file.
We simply store a 4k data block on one disk and one parity block on another disk. In total, we wrote a 4k data block and a 4k parity block.
We wrote 8k in blocks to store a 4k file.
4k / 8k = 50% storage efficiency.
This is the same storage efficiency we would expect from a mirror.
That should explain the subtitle: "Why you probably won't get the storage efficiency you think you will get".
It does't apply to you, if you have a 3-wide RAIDZ1 and only write files where the size is a multiple of 8k. For huge files like pictures, movies, and songs, the efficiency loss for not being exactly a multiple of 8k, the loss gets smaller and is negligible.
For Proxmox we mostly don't use datasets though. We use VMs with RAW disks that are stored on a Zvol.
For zvols and their fixed volblocksize, it gets more complicated.
As far as I understand it, in the early days, the default volblocksize was 8k and it was recommended to turn off compression. This was due to Solaris using 8k. Nowadays, it is recommended to enable compression and the current default is 16k since v2.2. Still not the default in Proxmox though. Some people in the forum recommend going as high as 64k on SSDs.
In theory, you wanna have writes that exactly match your volblocksize.
If the write is bigger than your volblocksize, you lose out on compression gains.
If the write is smaller than your volblocksize, you will have IO amplification, waste space (especially on RAIDZ), and will produce more fragmentation.
For the first part, let's assume that your volblocksize is the Proxmox default of 8k. Let us look at the different file sizes we want to write and how they behave on different pools.
First, we want to write a 4k file.
mirror: 8k data block and one 8k parity block. 16k total write to store 4k. 25% storage efficiency (expected 50%).
RAIDZ1 3-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 4k. 25% storage efficiency (expected 66%).
RAIDZ1 4-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 4k. 25% storage efficiency (expected 75%).
Conclusion: This is not great and shows what happens if the volblocksize is bigger than the filesize we write. Storage efficiency is very bad, we get IO amplification and fragmentation.
Next, we want to write an 8k file.
mirror: 8k data block and one 8k parity block. 16k total write to store 8k. 50% storage efficiency (expected 50%).
RAIDZ1 3-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 8k. 50% storage efficiency (expected 66%).
RAIDZ1 4-wide: 8k data block written to one disk and 8k to another disk for parity. 16k blocks total to store 8k. 50% storage efficiency (expected 75%).
Conclusion: For mirros, this works perfectly. For RAIDZ1 even though the write matches the volblocksize, storage efficiency, is still poor.
Next, we want to write a 16k file.
mirror: 16k data blocks and 16k parity blocks. 32k total write to store 16k. 50% storage efficiency (expected 50%).
RAIDZ1 3-wide: Each stripe has two 8k data blocks and one 8k parity block. We need one stripe to store 16k. One stripe is 24k.
24k total to store 16k. 66.66% storage efficiency (expected 66.66%).
RAIDZ1 4-wide: Each stripe has three 8k data blocks and one 8k parity block. We don't even need the whole stripe to store 16k. Instead, we can shorten the stripe to two 8k data blocks and one 8k parity block. One stripe is then only 24k instead of 32k.
24k total to store 16k. 66.66% storage efficiency (expected 75%).
Conclusion: For mirrors and a 3-wide RAIDZ1, this works perfectly. 4-wide RAIDZ1 geometry is still not optimal and offers less efficiency than expected.
Next, we want to write a 128k file.
mirror: 128k data blocks and 128k parity blocks. 256k total write to store 128k. 50% storage efficiency (expected 50%).
RAIDZ1 3-wide: Each stripe has two 8k data blocks and one 8k parity block. 128k/16k data blocks = 8. We need 8 stripes to store 128k.
One stripe is 24k. Multiplied by 8 stripes = 192k.
192k total to store 128k. 66.66% storage efficiency (expected 66.66%).
RAIDZ1 4-wide: Each stripe has three 8k data blocks and one 8k parity block. 128k/24k data blocks = 5.33 We need 6 stripes to store 128k.
The first five stripes each store 24k data blocks and an 8k parity block.
That is 24*5 = 120k of data blocks, which means the last stripe has to save 8k of data.
The sixth stripe stores an 8k data block and an 8k parity block.
In total, we store 120 + 40 + 8 + 8 = 176k to store 12k. 72.72% storage efficiency (expected 75%).
Conclusion: For mirrors and a 3-wide RAIDZ1, this works perfectly. 4-wide RAIDZ1 geometry is still not optimal but close.
For the second part, let's assume that we change the default volblocksize of Proxmox. Let's use the volblocksize of 64k, which some people (mostly SSD users) recommend in the forum.
First, we want to write a 4k file.
mirror: 64k data block and one 64k parity block. 128k total write to store 4k. 3.1% storage efficiency (expected 50%).
RAIDZ1 3-wide: 64k data block written to one disk and 64k to another disk for parity. 128k blocks total to store 4k. 3.1% storage efficiency (expected 66%).
RAIDZ1 4-wide: 64k data block written to one disk and 64k to another disk for parity. 128k blocks total to store 4k. 3.1% storage efficiency (expected 75%).
Conclusion: This is bad and shows what happens if the volblocksize is bigger than the filesize we write. Storage efficiency is very bad, we get extreme IO amplification and fragmentation.
We skip the other sizes and use a 1024k file.
mirror: 1024k data blocks and 1024k parity blocks. 2048k total write to store 1024k. 50% storage efficiency (expected 50%).
RAIDZ1 3-wide: Each stripe has two 64k data blocks and one 64k parity block. 1024k/128k data blocks = 8. We need 8 stripes to store 1024k.
Each stripe stores 128k data blocks and a 64k parity block.
In total, we store 192*8= 1536k for a 1024k file.
66.66% storage efficiency (expected 66.66%).
RAIDZ1 4-wide: Each stripe has three 64k data blocks and one 64k parity block. 1024k/192k data blocks = 5.3. We need 6 stripes to store 1024k.
The first 5 stripes each store 192k data blocks and a 64k parity block.
That is 192*5 = 960k of data blocks, which means the last stripe has to save 64k (1024-960) of data.
The sixth stripe stores a 64k data block and a 64k parity block.
In total, we store 960 + 320 + 64 + 64 = 1408k to store 1024k.
72.72% storage efficiency (expected 75%).
Conclusion: For mirrors and a 3-wide RAIDZ1, this works perfectly. 4-wide RAIDZ1 geometry is not optimal. Did you notice something? This is the same behavior we witnessed with a volblocksize of 8k and a file size of 128k earlier. That is because the ratio is the same. Both have a suboptimal pool geometry for stripes.
128k file / 24 data blocks = 5.33
1024 file / 192 data blocks = 5.33
Overall conclusion:
RAIDZ almost always does not offer the storage efficiency you think you will get, if you deal with block storage or smaller files. This is why Proxmox simply recommends using mirrors. You will get better performance and will probably not lose that much more storage. If you use any kind of block storage, go with mirrors.
It is already late and English is not my native language. I hope there are not too many errors in this.
Last edited: