ZFS fragmentation issues

jgreco · Mar 14, 2013

I've said it for some time now, ZFS and iSCSI may not be a match made in heaven, due to the inevitable fragmentation caused by copy-on-write for block updates.

This article shows real world results from someone who did similar work with a database, and ended up with severe fragmentation.

I just thought I'd post a pointer to this experience for anyone else who is interested in the topic. As usual, this doesn't mean that you cannot use ZFS for iSCSI applications, but it does suggest that both the iSCSI clients and the storage system ought to be designed carefully with the issue in mind.

paleoN · Mar 18, 2013

jgreco said:
This article shows real world results from someone who did similar work with a database, and ended up with severe fragmentation.

Or "real world" results. ;) I'm not sure what the minimum number of metaslabs ZFS creates, but using a 1GB disk mine were 8MB apiece. Being so small that will cause even greater "artificial" fragmentation as each becomes full rather quickly. Still interesting though.

jgreco · Mar 18, 2013

Being "so small" is largely irrelevant, it's really the percentage full that matters most. That's one of the hard lessons for administrators to wrap their heads around, having a terabyte free is great if your storage system is two terabytes, but very insufficient at 80 terabytes.

It seems pretty clear to me at this point that iSCSI is kind-of workable on ZFS, but sequential read operations on an iSCSI device will present problems where those blocks have been updated, so basically the ideal case is to have SSD L2ARC (or ARC!) large enough to store all changed blocks that are likely to be read. For writes, it is necessary to have a large bucket of free space so that writes for a transaction group end up being generally contiguous - if there's a high write volume and low free space, pool I/O skyrockets as the drives have to work harder doing more seeks. And by "large bucket" I do mean something like 40% or more free on your pool. ZFS setups meeting both of those conditions, along with the tuning issues I talk about in 1531, seem very pleasant and responsive for any workload I've tried.

paleoN · Mar 18, 2013

jgreco said:
Being "so small" is largely irrelevant, it's really the percentage full that matters most.

Yes, I don't disagree. My point was a lot less 8K blocks fit in an 8MB metaslab vs a 16GB one. Which means they will fill quicker and ZFS is going to be changing them much more often which will cause further fragmentation.

Then again perhaps 8K isn't significant vs 8MB and I haven't thought it through all the way? Definitely not. In my thinking I wasn't scaling this back up properly and it's not like the metaslabs are 32K.

noprobs · Mar 19, 2013

jgreco said:
ZFS setups meeting ... , along with the tuning issues I talk about in 1530, seem very pleasant and responsive for any workload I've tried.

Perhaps I am being dumb but what is the 1530 tuning link??

Thx

jgreco · Mar 19, 2013

paleoN said:
Yes, I don't disagree. My point was a lot less 8K blocks fit in an 8MB metaslab vs a 16GB one. Which means they will fill quicker and ZFS is going to be changing them much more often which will cause further fragmentation.

Then again perhaps 8K isn't significant vs 8MB and I haven't thought it through all the way? Definitely not. In my thinking I wasn't scaling this back up properly and it's not like the metaslabs are 32K.

Well, I see what you're thinking, but I'm pretty sure you figured out a way to look at it that helps you see why it is not really the impact you were thinking. 8KB is more significant against 8MB than a larger metaslab, but since some effort is made to distribute into different metaslabs (see metaslab_weight for picking algorithm), and 8KB is only .1% of 8MB, we're not really talking a significant percentage. Given that I'm advocating free space of 40% or more on a pool to maintain best performance, it is unlikely that any given metaslab would fall significantly short on free space. You'll still get lots of fragmentation, of course! There's also a danger that you could design something that's totally pathological (VERY possible), and it's quite likely that that would turn pathological at a lower threshold with the larger ratio of 8KB-to-8MB. If you can break the filesystem free map up enough that efficient allocations cannot happen even at 40% free space, I'm betting there'll be some tears, heh. It'd be interesting to try, and I think a simple way to play with it would be to randomly write blocks within a file and see if speed degrades over time.

- - - Updated - - -

noprobs said:
Perhaps I am being dumb but what is the 1530 tuning link??

Thx

No, I was being coffee--, 1531.

- - - Updated - - -

This aggregation of posts thing sucks.

cyberjock · Mar 19, 2013

jgreco said:
This aggregation of posts thing sucks.

Yeah, not sure I like it either.

toadman · Jan 23, 2016

jgreco said:
It seems pretty clear to me at this point that iSCSI is kind-of workable on ZFS, but sequential read operations on an iSCSI device will present problems where those blocks have been updated, so basically the ideal case is to have SSD L2ARC (or ARC!) large enough to store all changed blocks that are likely to be read. For writes, it is necessary to have a large bucket of free space so that writes for a transaction group end up being generally contiguous - if there's a high write volume and low free space, pool I/O skyrockets as the drives have to work harder doing more seeks. And by "large bucket" I do mean something like 40% or more free on your pool. ZFS setups meeting both of those conditions, along with the tuning issues I talk about in 1531, seem very pleasant and responsive for any workload I've tried.

At the risk of resuscitating an old thread... (but it seems appropriate for the topic vs. a new one)

There have been a couple recent threads about fragmentation. It got me thinking a bit more about flash. I submit in 2016 we're entering what could be termed the golden age of flash. All flash arrays, and flash inexpensive enough that at minimum a caching tier (ARC/L2ARC/SLOG) should exist in most decent arrays.

Obviously in an all flash array the fragmentation issue is largely irrelevant. (Though optimizations can occur if one manages the flash at the system level vs. the device level.) The question is more about how much flash would be required to "effectively" hide the performance issues caused by fragmentation on disk.

jgreco gives a pretty good argument for leaving 40% of a pool unused. It would be an interesting study to come up with a way to size ARC and L2ARC for a given workload that would make disk fragmentation effectively irrelevant. I haven't thought in depth about it much. Perhaps it would be a good thesis study or something. :)

None the less, an interesting topic I think. Maybe jgreco's proposed 40% could be reduced (significantly).

jgreco · Jan 23, 2016

I'm not sure I want to revive a three year old thread. However, it is good to point out that I've basically been saying the same things for years.

The main problem is that most people are not willing to throw down for an amount of RAM that would be appropriate to the task. You really need to be at a minimum of 64GB to play in the iSCSI game, and potentially more (or a lot more).

Creating an all-SSD array seems like an attractive "fix", but the truth is that fragmentation affects SSD as well. SSD's work with blocks and pages, which are some smallish multiple of the 4K advanced format sector size. You're just creating a more complex issue. Especially if you have some hope that RAIDZ on SSD will somehow be totally awesome ... it won't be, not for block storage.

The 40% free is, if anything, a bit of an underestimate. The Delphix blog shows that the eventual steady state performance at 60% full is pretty miserable:

You probably want to be more around 30-40% occupancy, or less-if-practical.

Write speeds can be maintained by keeping occupancy percentage low. Read speeds can be mitigated by sizing the ARC plus L2ARC to hold the working set (or more). Nothing new here.

toadman · Jan 23, 2016

So for sustained throughput workloads, that is ugly. ~15% performance at 50% full. (Would be interesting to see the same graph for an all flash array.)

Only in the case of bursty workloads can the flash caching layer (or ARC) help. Makes sense. Clearly.

jgreco · Jan 24, 2016

Well, it's helpful to be real clear on what you're seeing here. This is the ability of the pool to sustain a given level of random throughput at a particular occupancy level. This is pretty much "worst case behaviour" and not what you should actually expect, unless you're running databases or virtual machines and doing lots of writes, and you've been doing it long enough for things to reach a steady state.

What's more important is what the chart implies.

For a random I/O workload on a standard hard disk, let's say capable of 100 IOPS, 4K times 100 IOPS is only 400KB/sec. Because ZFS is taking random writes and aggregating them into a transaction group, that translates to FIFTEEN TIMES FASTER at 10% occupancy - the ZFS device can sustain what appears to the application to be 1500 IOPS or 6000KB/sec. As you fill the pool, it falls down towards the number of IOPS the underlying disk can support, because fundamentally if it can't aggregate into sequential writes then it has to seek for small blocks.

The other easily-missed thing here is that this also applies to sequential writes. Once there's a huge amount of fragmentation on the pool, even what we'd normally think of as sequential writes are also slowed down, which might really tick you off if you're used to your hard drive pulling 200MB/sec write speeds and suddenly you're getting 1MB/sec because your pool is fairly full and fairly fragmented.

If a light bulb goes off in your head and you suddenly realize that ZFS write speeds aren't really expected to be substantially different for random vs sequential speeds, ... good. It's all tied closely to fragmentation and pool occupancy. Not randomness of the workload, at least not all that much.

Reads, on the other hand, that's another matter. The biggest thing is that the tendency of a CoW filesystem like ZFS to deposit data wherever is convenient means that things we might expect to be sequential (large files, VM disk files) might or might not be. If I lay down an ISO on a relatively empty pool and never touch it, obviously it remains generally sequential. A VM disk file will also be laid down sequentially, but every update will involve shuffled blocks. That's where the ARC and the L2ARC can come in to help win the day. Your big VM disk file may be as scrambled up as the pieces of a puzzle still unassembled in a box.

Memory prices have now fallen to the point where 128GB of RAM is only ~$800, and a pair of competent NVMe L2ARC devices is about ~$700, so for $1500 you can outfit ZFS with a terabyte of cache on a Xeon E5. This will make sense for some people to do.

cyberjock · Jan 25, 2016

toadman said:
So for sustained throughput workloads, that is ugly. ~15% performance at 50% full. (Would be interesting to see the same graph for an all flash array.)

I don't have any cool graphs. I can tell you that a few customers have TrueNAS Z50s (all-flash storage) and they have been amazing beyond words. I won't give names or anything, but I've seen an all-flash array for VMs as well as an all-flash zpool for video rendering (think Toy Story) and in both scenarios the real-world workload doesn't even tax the zpool compared to what it can actually do after many months of usage in production.

So my opinion would be that the zpool curve of performance versus %full would be nearly horizontal, except perhaps once you get to a situation where you are 90% full (or something where the SSDs can no longer keep themselves well-kept and trimmed) performance probably drops to some really painful value. Also the performance curve we see is because of physical head motion being required. Those kinds of problems simply do not exist in SSDs.

But to me that says less about the %full and is more of a function of how busy the zpool is and how aggressive the SSDs trim themselves.

jgreco · Jan 25, 2016

Basically at that scale you're taxing the host system more than you are the SSD's. You have the potential to run into all new fun issues.

Important Announcement for The TrueNAS Community.

ZFS fragmentation issues

jgreco

Resident Grinch

paleoN

Wizard

jgreco

Resident Grinch

paleoN

Wizard

noprobs

Explorer

jgreco

Resident Grinch

cyberjock

Inactive Account

toadman

Guru

jgreco

Resident Grinch

toadman

Guru

jgreco

Resident Grinch

cyberjock

Inactive Account

jgreco

Resident Grinch

Similar threads

Important Announcement for The TrueNAS Community.