RAIDz (single vdev) write speeds are NOT limited to single drive write speed!

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
Good morning.

This really shouldn't be new information to anyone, but I see this incorrectly being claimed here on a daily basis. Stream writes are not limited to single drive performance. I believe this misunderstanding stems from users mixing stream read/write with IOPS, which are limited to single drive performance.

Stream read/write are limited to N-p, where N = total number of drives in vdev, and p = number of parity drives.

Example:
I have a RAIDz2 pool consisting of one vdev, with 10x10TB WD100EMAZ drives. Assuming a performance of 250IOPS and ~180MB/s per drive, the theoretical maximum) performance is:

Stream read: 1440MB/s
Stream write: 1440MB/s
IOPS: 250.

There are of course other bottlenecks, and for the system in question, running a Xeon E5-2650v2 (2,6GHz base, 3,3GHz boost, 64GB RAM, no cache drive, pool is both encrypted and compression enabled), I see sustained writes for several hundred gigabytes in excess of 600MB/s over SMB (about 260GB divided between 13 large video files)

Again, this is not new information, but there are still a massive number of users incorrectly claiming otherwise. Please stop spreading misinformation.

Source: https://www.ixsystems.com/blog/zfs-pool-performance-2/
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Nice try but if you're going to say "stop spreading misinformation", then so am I.

Here are some facts.

ZFS write speeds will generally be fast as long as there's large contiguous ranges of free space. You also need sufficient memory to hold the metadata ZFS needs to find the free space. It doesn't matter if you are writing sequential data or random data, ZFS will do a good job of writing at speeds that can easily approach what the sum of the underlying devices can handle. So you are partially correct.

The problem comes about when there's fragmentation. Then, something interesting happens. The maximum possible write speed falls. When ZFS has to start working to find free space regions, and the drives have to seek in order to reach them, this is a performance killer. In fact, you can create pathological situations where the maximum write speed falls to a very small fraction of what your claimed "stream" numbers are; I'd put good money on being able to get to 1/100th of those speeds. And this happens whether you are writing sequential or random data, because at the end of the day, ZFS doesn't care too much what is in a transaction group.

A RAIDZ vdev tends to adopt the IOPS characteristics of its component devices, and more specifically, the slowest component device.

Don't make the mistake of judging ZFS performance on a fresh pool. The real torture test is on a well-used aged pool, and then it gets really interesting to see how performance is affected.
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
Thank you for the extra insight. I have previously read this post from you (among others). But we can at least still put the claim that ZFS write speeds will never go beyond the speed of a single drives performance, to rest. :)

That being said, I am curious about the testing methodology in the post where you go into detail about how write speeds are affected by this. You wrote: " Our friends at Delphix did an analysis of steady state performance of ZFS write speeds vs pool occupancy on a single-drive pool and came up with this graph:" (I am unable to link to the image from this computer). Then we can see that performance drops from 6MB/s for 10% pool full, and down to around 1MB/s for 50% pool full, or 6 times slower!

Again, my pool is fairly fresh (built around 6 months ago), and is almost exactly 50% filled, and I still see sustained write speeds of 600+MB/s, or close to half of my theoretical maximum speed (N-p). This test was done, as stated in my original post, with 13 large media files totaling 260GB, far beyond my RAM size (64GB), where, I believe, only 1/8th gets used for write cache as standard.

Why do you believe this is? Is it possible that Delphix' testing with only a single-drive pool, with extremely poor performance to begin with (6MB/s?), doesn't give accurate results when compared to real world setups?

Edit: I am on a crappy work-client running an old version of Internet Explorer. Is this the article you are referencing to?: https://www.delphix.com/blog/delphix-engineering/zfs-write-performance-impact-fragmentation

I am unable to get the graphs to load, but if it is, it is 7 years old, and if so, is it possible that ZFS performance could have improved since then? It would be interesting to do some testing of my own, but I dont have a spare hard drive laying around.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
No, this is a compsci thing, not a Delphix-made-an-error thing, not a ZFS-performance-could-have-been-fixed thing.

ZFS is essentially trading resources for speed.

The Delphix graph was generated using a single-drive pool. This really has nothing to do with much of anything except that it made it possible to produce the graph in a reasonable amount of time. It was intended to show measurement for a thing that is known about ZFS, and other CoW filesystems.

Your pool is fast because it is only lightly fragmented, and with large amounts of contiguous free space.

The ultimate test of ZFS is not this condition. It is the far side. When the pool is old. When the pool is as fragmented as it is likely to get. Once performance has degraded as far as it is going to, you reach something we call "steady state", the state where performance has settled at a level where it is no longer degrading, and only varying by a little bit over time.

The Delphix graph shows steady state performance at various pool capacities.

ZFS goes fast when it can, for example, allocate a 1MB block and write it into contiguous space, that's fast.

However, if you have a pool that is 90% full and has been seeing lots of updates for years, you run into a situation where you have little blocks of space spread out sorta-evenly throughout the pool. To write a large file requires that ZFS allocate lots of little noncontiguous blocks all over the place and seek for them.

Those are the extremes.

ZFS can attain SSD-like write speeds for a HDD pool when the pool is relatively empty. The reason for this is that large contiguous runs of free space are allocated for the transaction group, so when the txg is laid down, it's a largely sequential write. It doesn't matter if you *think* you are writing sequential data (like a media file) or random data (like a database or VM block store). ZFS is a copy-on-write filesystem so any filesystem write is never overwriting existing data in its existing location, and is always allocating free space somewhere else. Note that this implies fragmentation increasing, AND that reads of data that you might think would be sequentially stored on disk often aren't. ZFS relies on ARC and L2ARC to sort out the fragmentation performance issues.

If you keep pool occupancy really low, ZFS will remain able to write fast regardless of pool age or fragmentation.

If pool occupancy is high and fragmentation is heavy, ZFS won't be able to write anything fast. This is typically a scenario pools slowly age into. The Delphix guy just caused it to happen quickly.

There's lots of posts about ZFS admins running stuff like databases where performance degrades to the point where things are very bad, and they schedule a downtime to move all the data off the pool and back on. This is due to fragmentation and not having sufficient free space and ARC.

So this is really a practical application of compsci. To remediate a known problematic resource (seek IOPS), we can give ZFS lots of pool free space, and lots of ARC/L2ARC, and the problem "goes away".

This is what the Delphix graph is showing.
 

ChrisReeve

Explorer
Joined
Feb 21, 2019
Messages
91
I realize all this, but I still feel it could be misleading and an oversimplification to reference to a graph showing a huge performance degradation when reaching 50% full, when this just isn't neccessarily the case. As you are saying, the performance loss will come over a long amount of time, and (I assume) several cycles of filling up, then removing files, leading to more fragmentation.

It is a factor of both fill grade, and, for a lack of a better term, write cycles (leads to more fragmentation), right?

Do you agree that a single pool that has been filled to 50% once, will not see this performance loss? That the performance loss will slowly become a factor as huge amounts of data (meaning a sizeable fraction of the total pool size) has been overwritten/deleted/added?

Also, this is interesting and absolutely relevant, but I just want to re-iterate: This thread was made as a reaction to several users claiming that a ZFS pool will never be able to deliver write speeds faster than single drive write speed, no matter what. Which is wrong.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Do you agree that a single pool that has been filled to 50% once, will not see this performance loss? That the performance loss will slowly become a factor as huge amounts of data (meaning a sizeable fraction of the total pool size) has been overwritten/deleted/added?

That's the issue being addressed, obviously. Pool age and reaching a steady state level of performance.

Also, this is interesting and absolutely relevant, but I just want to re-iterate: This thread was made as a reaction to several users claiming that a ZFS pool will never be able to deliver write speeds faster than single drive write speed, no matter what. Which is wrong.

Equally as wrong is to spread misinformation about how everybody is wrong just because you got fast speeds this one time on a fresh pool. I can break your pool and make it perform terribly. I can actually create a pathological situation that is worse than the Delphix steady state, I'm pretty sure, because the Delphix is predicated on relatively normal pool usage whereas I can actually design a worst-case scenario.

See, the thing is, most people put together a ZFS system and run some trite benchmarks and OH MY GOD I CAN WRITE A ZILLION GIGABYTES PER SECOND IT IS THE BEST THING EVER. Before the practical reality hits them in the face, a month, a year, five years later, whatever, after it has been heavily used and then becomes much slower.

I'm not interested in how fast ZFS can go when it's brand new. What I want is to know how fast I can rely on it to go, in production, years later.

So all your handwringing about memory and sequential performance on a new pool means very little to me.
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
653
I can confirm what jgreco said from my own experience. I have encrypted pool of 6x3TB WD in RAIDZ2 (so nothing huge) running for ~7 years or so. Part of it is an "archvive" holding both big and small files. Other part is a quite busy area which is continuously changing (regular backups of few PCs, temp-data, etc...). I am very close to the 80% space soft-cap and with ZFS (free-space) fragmentation ~70% or something (can't check now). The speeds are still OK for CIFC over 1G network BUT the raw/local performance of the pool is quite different than it was 7 years ago.
 

nemesis1782

Contributor
Joined
Mar 2, 2021
Messages
105
Hi, sorry for reviving an old thread. I'm wanting to switch to ZFS and something puzzles me. Fragmentation issues are as old as spinning disk platters. Fragmentation is a fact of life for any storage device with ever changing data. On spinning disks this is dealt with by defragmentation which is a extremely simple process. You take data move it, to try and get the data together as much as possible and the holes as small as possible.

I'm flabber gasted that ZFS has no such mechanism. So I keep assuming I'm wrong. My question is am I wrong and is there a way to deal with it?

I BTW was also led to believe that the write speeds to a raidz array are limited by the speed of the slowest disk in the vdev.

I'm not interested in how fast ZFS can go when it's brand new. What I want is to know how fast I can rely on it to go, in production, years later.
Exactly, so what can I expect?

So all your handwringing about memory and sequential performance on a new pool means very little to me.
I'd disagree it's good to know what it's capable of and why it would degrade. Only then can you find a way to mitigate the issue.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,112
The above explains that performance degrades as the pool is used (new data written) because the pool is used.
Solutions: A/ Fill the pool in one go and don't touch it. B/ Backup, destroy the pool, recreate the pool and restore.
 

nemesis1782

Contributor
Joined
Mar 2, 2021
Messages
105
The above explains that performance degrades as the pool is used (new data written) because the pool is used.
Solutions: A/ Fill the pool in one go and don't touch it. B/ Backup, destroy the pool, recreate the pool and restore.
Well I'd call them work arounds not solutions but yeah that seems to be the jist of it.

So I was wondering (don't have the hardware to test it on yet, should arrive in a few days though) if I have a zPool is there a way to reserver a specific block of space for a ZFS dataset on a zPool. This way you could do this in smaller chunks. Let's say I have a ZPool 12TB I create 3x 3TB datasets with a reservation&Quota for each of 3TB. Once one of them gets fragmented I move ALL the data off of a dataset and then move it back. Moving around blocks of say 3TB is much more feasable then blocks of 12TB. Not to mention the 40TB I wish to create.

Also this could easily be automated couldn't it? Or am over simplyfing things?
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,112
You're just missing the point. Performance degrades as a pool gets fragmented, so the solution (fixup, workaround, whatever—solution A/ was meant to be ironic) is to move to another pool. Moving between datasets in the same (fragmented) pool is useless at best.
ZFS is NOT a backup, so you're supposed to have a backup—ideally a second ZFS system to use automated replication (and then moving around a 40 TB snapshot over 10 GbE with zfs send / zfs receive is very feasible).
 

nemesis1782

Contributor
Joined
Mar 2, 2021
Messages
105
A/ was meant to be ironic
Hehe, didn't get that. Also this is a really good use case for ZFS though. Archive storage.

You're just missing the point. Performance degrades as a pool gets fragmented, so the solution (fixup, workaround, whatever—solution A/ was meant to be ironic) is to move to another pool. Moving between datasets in the same (fragmented) pool is useless at best.
I'm sorry, but ZFS's pupose is to provide enterprise level storage solutions. Also somewhere in the documentation it says that the purpose is to provide a long living storage solution.

ZFS is NOT a backup, so you're supposed to have a backup—ideally a second ZFS system to use automated replication (and then moving around a 40 TB snapshot over 10 GbE with zfs send / zfs receive is very feasible).
Correct. Reparing fragmentation has little to do with backup. Also keep in mind that replication in and of itself may also not be considered a backup.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,906
On spinning disks this is dealt with by defragmentation which is a extremely simple process. You take data move it, to try and get the data together as much as possible and the holes as small as possible.
This approach works relatively easily on file systems that are not Copy-on-Write (CoW). My guess is that it is not impossible on CoW file systems, but a lot more complicated. And so far nobody has spent the money on adding it to ZFS, which indicates that there is no obvious economic justification. At the end of the day the money for ZFS comes from enterprises where the requirements are, at least partially, fundamentally different from us home users. In such a context spending 100 k$ on a spare system as a means to get rid of fragmentation is not a big deal. In fact it will likely be a bargain compared to what proprietary storage systems cost.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
So there are basically several issues here to ponder. Here are a few deal-breakers:

ZFS has a variable block size. Any write that causes a large block, like a 1MB block, to be broken up into smaller blocks, with their own parity overhead, could be making it physically impossible to fit the rewritten data back into the same place (which assumes ZFS isn't a CoW filesystem).

ZFS *is* a CoW filesystem. So think about what happens. You have a metadata block that lists the blocks which make up a file. When you modify a byte in the middle of that file, you have to free a block that's in the middle of that file, allocate a new block for the file data and write it, PLUS, you have to rewrite the metadata block, which means that you actually allocate a new metadata block, update THAT, and free the old one.

Now you're sayin' to yourself, "well they mastered that in DOS years ago" and you'd sorta be right, but the problem is that ZFS introduces functionality like snapshots and clones, so any "defrag" functionality needs to also hunt down every consumer of a block and make sure that's updated, and it becomes a challenging process to analyze the structure of what could be a very large file, all of its snapshots and clones, take into account compression, and then make sure you have created an equivalent structure. And because a 1MB freed block of space that was previously in the middle of file data is big enough that it will likely attract smaller new blocks like metadata, you can't really just "put things back where they were".... It is probably impractical to do without rewriting the entire file and all its clones/snaps/etc, and that assumes a decent amount of free space being available.
 

nemesis1782

Contributor
Joined
Mar 2, 2021
Messages
105
This approach works relatively easily on file systems that are not Copy-on-Write (CoW). My guess is that it is not impossible on CoW file systems, but a lot more complicated.
Yes, of course it's not as easy as I made it to be. There are many things to be aware of and deal with as @jgreco expectedly described.

And so far nobody has spent the money on adding it to ZFS, which indicates that there is no obvious economic justification.
Well not necessarily. Often it is a matter of it being to difficult de make some one who pushes number in excel sheets understand the benefits (or rather risks of not doing it). Not being able to make a business case of it being difficult to make one does not mean there isn't one. It also does mean there is one. I can only speculate.

At the end of the day the money for ZFS comes from enterprises where the requirements are, at least partially, fundamentally different from us home users. In such a context spending 100 k$ on a spare system as a means to get rid of fragmentation is not a big deal. In fact it will likely be a bargain compared to what proprietary storage systems cost.
Yes the requirements are definitely different from home users and also vary very much per use case. A storage system for archiving has fundamentally different requirements opposed to one which will host databases.

The problem to get rid of said fragmentation it's not just spending 100 k$. But also increasing the complexity of the solution for cases where this might not be required. This in turn drives up the setup/support/maintenance costs. Also you need networking and bandwidth to support it which is another 100k$ (and 100k$ is nothing in these cases).

Now you have replication setup but to mitigate the fragmentation you need to either plan downtime of have all consumers switch to your replication destination which then temporarily becomes the master (which will be just as fragmented).

Then you need to remove the data from the fragmented storage and repropagate it, preferably including snaphots. Now doing this means you at least a thrid system holding the data because you NEED a backup and probably also fourth because you need a failover.

So you set it up as a round robin and automate it.........
 

nemesis1782

Contributor
Joined
Mar 2, 2021
Messages
105
@jgreco thank you for the very clear and well thought through reply. Just to make clear, not trying to contradict you just verifying I understand and maybe add a fresh perspective. Although I doubt I have anything to contribute you did not see coming a [enter a large value in your preferred unit of measure for distance] away ;)

ZFS has a variable block size. Any write that causes a large block, like a 1MB block, to be broken up into smaller blocks, with their own parity overhead, could be making it physically impossible to fit the rewritten data back into the same place (which assumes ZFS isn't a CoW filesystem).
This is true for just about any file system isn't it? Or have I misinterpreted or misunderstood something.

ZFS *is* a CoW filesystem. So think about what happens. You have a metadata block that lists the blocks which make up a file. When you modify a byte in the middle of that file, you have to free a block that's in the middle of that file, allocate a new block for the file data and write it, PLUS, you have to rewrite the metadata block, which means that you actually allocate a new metadata block, update THAT, and free the old one.
So as I understand it CoW is (simplified):
- On creation, put the data create references
- On update, put new/modified data update the reference and remove old now deprecated data

As with any file system you have the metadata which contains all the information relevant to the file and it's structure (pointers, timestamps and whatever else the file system requires to interpret and manage the file). Because of the COW you have to indeed create both anew (mutated data blocks and metadata) and then remove the old. Not sure why the update is there since the newly written block as well as the newly written metadata already contain the modification I thought.

I also thought the data and meta data were on two independent vDevs called Fusion pools.

As a aside SQL Server for instance uses CoW and provides defragmentation facilities. Not only that but if you have many mutations on your DB it's a requirement to do this unless you want to have to take a sabbatical every time you wish to query data from it :P

Now you're sayin' to yourself, "well they mastered that in DOS years ago" and you'd sorta be right,
Well yes and no. Yes, as in I view it as a basic necessity for a filesystem, let alone a enterprise grade one. No, as in being able to make a paper plane that flies (or actually glides), does not suddenly make you capable of designing and building a Jumbo Jet that flies.

but the problem is that ZFS introduces functionality like snapshots and clones, so any "defrag" functionality needs to also hunt down every consumer of a block and make sure that's updated
Well not necessarily. There are ways around this. For instance keeping a record of all active consumers, you can then notify the consumer of said change, once there are no more active consumers the old data and the metadata are freed. Now this has quite a few implications of course, but it's far from impossible.

, and it becomes a challenging process to analyze the structure of what could be a very large file, all of its snapshots and clones, take into account compression, and then make sure you have created an equivalent structure.
Agreed! Very resource heavy!

And because a 1MB freed block of space that was previously in the middle of file data is big enough that it will likely attract smaller new blocks like metadata, you can't really just "put things back where they were"....
True, isn't it a best practice to have the metadata on a Fusion pool though?

It is probably impractical to do without rewriting the entire file and all its clones/snaps/etc,
Definitely true and completely acceptable. I actually think you'll have to do this twice. Once to clean up a big chunk of space (preferably keeping a reservation on it so it will not be filled by something else in the meantime). And then to put it back in a nice and tidy manner. Preferably in a intelligent manner, placing cold data at the front, getting hotter the farther back you get.

Maybe even adding intelligence to differentiate in clones/snaps and current data since chances are that less modifications happen on snaps, basicly the only modification possible is removing them (unless they get elevated to current, if this is possible). Or even better have different algorithms for different use cases.

and that assumes a decent amount of free space being available.
Obviously as with any form of defragmentation. However for ZFS this might be quite a bit higher then it's brethren, which is acceptable. Luckily most users are accustomed to leaving 20% of a zPool empty. Something that then becomes much more logical and acceptable in my opinion. I'd actually say give users the option to set a reservation/soft limit on a zPool just for this purpose.
 
Top