Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.

RAIDz (single vdev) write speeds are NOT limited to single drive write speed!

Western Digital Drives - The Preferred Drives of FreeNAS and TrueNAS CORE

ChrisReeve

Member
Joined
Feb 21, 2019
Messages
89
Good morning.

This really shouldn't be new information to anyone, but I see this incorrectly being claimed here on a daily basis. Stream writes are not limited to single drive performance. I believe this misunderstanding stems from users mixing stream read/write with IOPS, which are limited to single drive performance.

Stream read/write are limited to N-p, where N = total number of drives in vdev, and p = number of parity drives.

Example:
I have a RAIDz2 pool consisting of one vdev, with 10x10TB WD100EMAZ drives. Assuming a performance of 250IOPS and ~180MB/s per drive, the theoretical maximum) performance is:

Stream read: 1440MB/s
Stream write: 1440MB/s
IOPS: 250.

There are of course other bottlenecks, and for the system in question, running a Xeon E5-2650v2 (2,6GHz base, 3,3GHz boost, 64GB RAM, no cache drive, pool is both encrypted and compression enabled), I see sustained writes for several hundred gigabytes in excess of 600MB/s over SMB (about 260GB divided between 13 large video files)

Again, this is not new information, but there are still a massive number of users incorrectly claiming otherwise. Please stop spreading misinformation.

Source: https://www.ixsystems.com/blog/zfs-pool-performance-2/
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
12,828
Nice try but if you're going to say "stop spreading misinformation", then so am I.

Here are some facts.

ZFS write speeds will generally be fast as long as there's large contiguous ranges of free space. You also need sufficient memory to hold the metadata ZFS needs to find the free space. It doesn't matter if you are writing sequential data or random data, ZFS will do a good job of writing at speeds that can easily approach what the sum of the underlying devices can handle. So you are partially correct.

The problem comes about when there's fragmentation. Then, something interesting happens. The maximum possible write speed falls. When ZFS has to start working to find free space regions, and the drives have to seek in order to reach them, this is a performance killer. In fact, you can create pathological situations where the maximum write speed falls to a very small fraction of what your claimed "stream" numbers are; I'd put good money on being able to get to 1/100th of those speeds. And this happens whether you are writing sequential or random data, because at the end of the day, ZFS doesn't care too much what is in a transaction group.

A RAIDZ vdev tends to adopt the IOPS characteristics of its component devices, and more specifically, the slowest component device.

Don't make the mistake of judging ZFS performance on a fresh pool. The real torture test is on a well-used aged pool, and then it gets really interesting to see how performance is affected.
 

ChrisReeve

Member
Joined
Feb 21, 2019
Messages
89
Thank you for the extra insight. I have previously read this post from you (among others). But we can at least still put the claim that ZFS write speeds will never go beyond the speed of a single drives performance, to rest. :)

That being said, I am curious about the testing methodology in the post where you go into detail about how write speeds are affected by this. You wrote: " Our friends at Delphix did an analysis of steady state performance of ZFS write speeds vs pool occupancy on a single-drive pool and came up with this graph:" (I am unable to link to the image from this computer). Then we can see that performance drops from 6MB/s for 10% pool full, and down to around 1MB/s for 50% pool full, or 6 times slower!

Again, my pool is fairly fresh (built around 6 months ago), and is almost exactly 50% filled, and I still see sustained write speeds of 600+MB/s, or close to half of my theoretical maximum speed (N-p). This test was done, as stated in my original post, with 13 large media files totaling 260GB, far beyond my RAM size (64GB), where, I believe, only 1/8th gets used for write cache as standard.

Why do you believe this is? Is it possible that Delphix' testing with only a single-drive pool, with extremely poor performance to begin with (6MB/s?), doesn't give accurate results when compared to real world setups?

Edit: I am on a crappy work-client running an old version of Internet Explorer. Is this the article you are referencing to?: https://www.delphix.com/blog/delphix-engineering/zfs-write-performance-impact-fragmentation

I am unable to get the graphs to load, but if it is, it is 7 years old, and if so, is it possible that ZFS performance could have improved since then? It would be interesting to do some testing of my own, but I dont have a spare hard drive laying around.
 
Last edited:

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
12,828
No, this is a compsci thing, not a Delphix-made-an-error thing, not a ZFS-performance-could-have-been-fixed thing.

ZFS is essentially trading resources for speed.

The Delphix graph was generated using a single-drive pool. This really has nothing to do with much of anything except that it made it possible to produce the graph in a reasonable amount of time. It was intended to show measurement for a thing that is known about ZFS, and other CoW filesystems.

Your pool is fast because it is only lightly fragmented, and with large amounts of contiguous free space.

The ultimate test of ZFS is not this condition. It is the far side. When the pool is old. When the pool is as fragmented as it is likely to get. Once performance has degraded as far as it is going to, you reach something we call "steady state", the state where performance has settled at a level where it is no longer degrading, and only varying by a little bit over time.

The Delphix graph shows steady state performance at various pool capacities.

ZFS goes fast when it can, for example, allocate a 1MB block and write it into contiguous space, that's fast.

However, if you have a pool that is 90% full and has been seeing lots of updates for years, you run into a situation where you have little blocks of space spread out sorta-evenly throughout the pool. To write a large file requires that ZFS allocate lots of little noncontiguous blocks all over the place and seek for them.

Those are the extremes.

ZFS can attain SSD-like write speeds for a HDD pool when the pool is relatively empty. The reason for this is that large contiguous runs of free space are allocated for the transaction group, so when the txg is laid down, it's a largely sequential write. It doesn't matter if you *think* you are writing sequential data (like a media file) or random data (like a database or VM block store). ZFS is a copy-on-write filesystem so any filesystem write is never overwriting existing data in its existing location, and is always allocating free space somewhere else. Note that this implies fragmentation increasing, AND that reads of data that you might think would be sequentially stored on disk often aren't. ZFS relies on ARC and L2ARC to sort out the fragmentation performance issues.

If you keep pool occupancy really low, ZFS will remain able to write fast regardless of pool age or fragmentation.

If pool occupancy is high and fragmentation is heavy, ZFS won't be able to write anything fast. This is typically a scenario pools slowly age into. The Delphix guy just caused it to happen quickly.

There's lots of posts about ZFS admins running stuff like databases where performance degrades to the point where things are very bad, and they schedule a downtime to move all the data off the pool and back on. This is due to fragmentation and not having sufficient free space and ARC.

So this is really a practical application of compsci. To remediate a known problematic resource (seek IOPS), we can give ZFS lots of pool free space, and lots of ARC/L2ARC, and the problem "goes away".

This is what the Delphix graph is showing.
 

ChrisReeve

Member
Joined
Feb 21, 2019
Messages
89
I realize all this, but I still feel it could be misleading and an oversimplification to reference to a graph showing a huge performance degradation when reaching 50% full, when this just isn't neccessarily the case. As you are saying, the performance loss will come over a long amount of time, and (I assume) several cycles of filling up, then removing files, leading to more fragmentation.

It is a factor of both fill grade, and, for a lack of a better term, write cycles (leads to more fragmentation), right?

Do you agree that a single pool that has been filled to 50% once, will not see this performance loss? That the performance loss will slowly become a factor as huge amounts of data (meaning a sizeable fraction of the total pool size) has been overwritten/deleted/added?

Also, this is interesting and absolutely relevant, but I just want to re-iterate: This thread was made as a reaction to several users claiming that a ZFS pool will never be able to deliver write speeds faster than single drive write speed, no matter what. Which is wrong.
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
12,828
Do you agree that a single pool that has been filled to 50% once, will not see this performance loss? That the performance loss will slowly become a factor as huge amounts of data (meaning a sizeable fraction of the total pool size) has been overwritten/deleted/added?
That's the issue being addressed, obviously. Pool age and reaching a steady state level of performance.

Also, this is interesting and absolutely relevant, but I just want to re-iterate: This thread was made as a reaction to several users claiming that a ZFS pool will never be able to deliver write speeds faster than single drive write speed, no matter what. Which is wrong.
Equally as wrong is to spread misinformation about how everybody is wrong just because you got fast speeds this one time on a fresh pool. I can break your pool and make it perform terribly. I can actually create a pathological situation that is worse than the Delphix steady state, I'm pretty sure, because the Delphix is predicated on relatively normal pool usage whereas I can actually design a worst-case scenario.

See, the thing is, most people put together a ZFS system and run some trite benchmarks and OH MY GOD I CAN WRITE A ZILLION GIGABYTES PER SECOND IT IS THE BEST THING EVER. Before the practical reality hits them in the face, a month, a year, five years later, whatever, after it has been heavily used and then becomes much slower.

I'm not interested in how fast ZFS can go when it's brand new. What I want is to know how fast I can rely on it to go, in production, years later.

So all your handwringing about memory and sequential performance on a new pool means very little to me.
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
587
I can confirm what jgreco said from my own experience. I have encrypted pool of 6x3TB WD in RAIDZ2 (so nothing huge) running for ~7 years or so. Part of it is an "archvive" holding both big and small files. Other part is a quite busy area which is continuously changing (regular backups of few PCs, temp-data, etc...). I am very close to the 80% space soft-cap and with ZFS (free-space) fragmentation ~70% or something (can't check now). The speeds are still OK for CIFC over 1G network BUT the raw/local performance of the pool is quite different than it was 7 years ago.
 
Top