Baffling Performance issues with large zfs pool

c77dk · Aug 31, 2020

Yorick said:
Since you've done so much work on this use case: What are your go-to recommendations for SSD models for a special alloc mirror vdev?

I'm sure the answer will be Optane

Stilez · Aug 31, 2020

Yorick said:
Since you've done so much work on this use case: What are your go-to recommendations for SSD models for a special alloc mirror vdev?

If you can afford it, Optane 900p or 905p range for special vdevs. Expensive but read my resource on choosing SSDs to know why.

If not, then Intel anything NVMe/data centre, or Samsung Pro SSDs (ideally not EVO and NEVER QVO or PM series and other OEM models, as these have great read rates and very poor mixed+write rates, or use cache to mask serious QLC NVRAM drawbacks). Both are very good. Widely regarded as the most dependable SSDs for solid serious use, out there. 2nd hand EBay is fine for those.

Aim for SSD models with good sustained IO rates. For example, SSDs that self-describe as NVMe, U.2, or M.2 "M" key (only!) - it doesnt matter if they are tiny SSD cards, PCIe devices or 2.5 inch form factors, as long as they are one of those they will work at NVMe speeds. Avoid if possible Sata, M.2 "B" key, and M.2 SATA, but that said at a pinch a good SATA SSD from the above model ranges will still make a hell of a diffeence over HDD and might be enough. I didn't bother trying those, however.

PCIe and M.2 SSDs and adapters can come in both NVMe and SATA variants. To be sure which it is, check if it says NVMe/"M" key/U.2 (it's NVMe) or if it says SATA/"B" key (it's Sata). Also if it can push more than about 125k IOPS or 800 MB/sec of data, it's also probably NVMe as well, those are beyond what Sata can do.

U.2, M.2 "M" key, and anything that explicitly says it's NVMe, uses PCIe 4x for their connectivity, so the adapter is literally just connecting their different layout electrical contacts - its all interchangeable, electrically, and adapters exist. M.2 "B" key and Sata disks use SATA signalling on the PCIe bus and standalone Sata buses, respectively. They aren't the same.

As at August 2020, I wouldn't consider anything else, and you shouldn't need to. These SSDs will have to handle.long term sustained mixed 4k RW loads and heavy levels of data writes, and most SSDs won't deliver their headline figures on that kind of activity for long periods. Or have less dependable controllers. The SSDs you use, need to be fast enough and low enough latency that they almost don't need internal DRAM caching, or at least dont suffer tooo badly when its fully used, otherwise when the SSD cache fills, performance can plummet (look for reviews with sustained R/W/mixed IO charts).

Which is basically, Optane 900p/905p first by a dozen miles, then Intel datacentre (P37xx or at a pinch Intel P36xx and 750 NVMe, all cheap 2nd hand), then Samsung 850 ~ 970 Pro only because they are built like tanks for reliability and efficiency among enthusiast SSDs, then Intel P37xx/P36xx SATA, and finally if desperate Samsung 850 ~ 970 SATA or EVO as a last ditch resort. Among all those, NVMe, M.2 "M" key/PCIe if at all able. And nothing else.

(OK, one exception: Mayyyyyybe Intel P48* Optane if you need datacentre guarantees, have thousands to spend, and 900p/905p isn't good enough!)

Read the resource I linked and the couple of backing articles linked in it, to understand why, if you (or anyone) is unfamiliar with bathtub curves.

As for power loss, honestly unless you're mission critical the worst that will happen is it'll roll back a little, the last small amount. But that's what HDDs traditionally do on power loss anyway. Graceful handling (whats been written staying good and readable on reboot, even in bad poweroff situations) matters a lot more. Most modern controllers do that. Optane is so fast it almost doesn't need any power loss mitigation anyway, so another plus.

SPECIAL VDEV SIZING AND STRUCTURE:

As at August 2020, assume you may need them bigger than you think. I'm trying to understand why. My pool has 190M DDT entries, at ~750 bytes each according to ZFS (zpool status -Dv) and regular metadata is said to typically be 0.1-0.2% of pool size, at 40 TB,so I expected it to take maybe 200 or so GB but its well over 300 GB - I don't know how much by, or why, yet. And that's just at this time, newly rewritten and unfragmented, not including future data and fragmentation. So err with a lot of caution on size - but you can always add a 2nd special vdev later. The giveaway will be special vdev at about 76% or 86% full, in zpool list -v capacity column. That says ZFS has stopped trying to fill it - there's a tunable for how much a vdev will be filled if the others aren't that full, usually 75% I think? I increased mine to 85%. After that it'll begin dumping metadata back on HDD instead.

The problem then is even if you buy another SSD vdev, the metadata/DDT that *already* spilled onto HDD wont be moved unless its redundant and not needed, i.e., never. Or until you SSD-ise or replicate/rewrite your entire HDD pool (remove wont work as its striped across ALL HDDs once it no longer fits on special vdevs). So you really don't want to get close to special vdev capacity if avoidable, because you won't know that's happening until it's already happening.

As this is crucial data, mirrors always, for redundancy and efficiency (and resilver speed if needed). And only mirrors. You do *not* want a special vdev on parity raidz. If the webUI says it wants them raidz, override it.

UPDATE: ADDED THIS TO THE ABOVE RESOURCE LINK AS WELL :)

c77dk said:
I'm sure the answer will be Optane

Yes, if able :) You know me well on that!

Strictly, I don't know if it's "needed". Maybe a 1TB Samsung 970 PRO NVMe card would be fine and a fifth the price of a 905p 1TB SSD. Maybe good enough to avoid the OP scenario?

The problem comes, that those special vdevs will be doing sustained mixed IO. Scrubbing while in use on SMB as well? iSCSI mixed RW? Multi-client mixed tasks? Who knows. I don't want a pool that stalls horribly because of a mixed workload, and I know it may well need 1/4M IOPS even under a mixed workload to do some tasks, and good write-during-read latency. Most SSDs, even datacentre, just can't do that. (See resource again). No other SSD technology can really delivery that, at those levels.

That, and not being a fanboi, is why..... I want my pool dependably fast, not task-dependently surprise-stalling.

Yorick · Aug 31, 2020

Stilez said:
You do *not* want a special vdev on parity raidz

Indeed I don't think you even can, and the WebUI should stop suggesting raidz already. I consider that a bug. The UI needs to insist on mirrors.

Stilez · Aug 31, 2020

Yorick said:
Indeed I don't think you even can, and the WebUI should stop suggesting raidz already. I consider that a bug. The UI needs to insist on mirrors.

You're looking for NAS-106762 ("Improvements to 'Create Pool' UI").

I already reported it. Bullet #10, 1st comment. Maybe you'd like to emphasise it? I didn't put it quite that strongly, but I broadly agree - or a serious warning at least.

TrumanHW · Sep 1, 2020

Stilez said:
While there are many valid points in the above answers, the basic diagnosis so far is incorrect. The correct explanation is obscure and very different.

Once I saw the very clear description of symptoms in your first post I looked for dedup being on in your screenshots/outputs, and knew I'd see it. This collection of baffling behaviours is exactly what I had in the past, and this is what's actually happening.

The good news is, if your other hardware is good enough (sufficient RAM etc), your issues will probably be 100% resolved if you switch to special metadata SSD vdevs in TrueNAS-12 and replicate your pool to move metadata there (zfs send -R | zfs recv). That will fix it.

THE ISSUE:

In brief, deduplication places an immense demand on the system. Everyone knows it demands a lot of your RAM capacity. What's far less well known is the incredible levels of demand that dedup *also* places on 4k random IO and (not so relevant here) CPU for hashing.

With dedup enabled, *every* block read or written will *also* need multiple dedup table (DDT) entries read or written. That's inherent with dedup. All blocks are potentially deduped not just the file data. That's what dedup is - you're reducing pool space requirements at the cost of very high 4k random IO and CPU usage (for dedup hashing). All those blocks need DDT entries looking up, for any read/write.

To give an idea of scale, my pool has 40 TB actual data deduped to 13.9 TB (about 3x) and needs almost 200 *million* DDT entries to dedup it. Yours is much less deduped as well (~1.7 x). You can see how many entries are in your dedup table using zpool status -Dv.

Without an all-SSD pool or (v12 only) SSD special vdevs, that 4k random IO is what's ultimately destroying your pools responsiveness and triggering the hangs. But it's doing it in a nasty indirect way, and taking the networking buffers and network IO capability down, along with it.

You can check this conclusively using gstat, but I know what you'll see.

Reboot (to clear any dedup/metadata/cached data from RAM/ARC/L2ARC). After reboot, run gstat on the console so it's not working across a network (something like gstat -acsp -I 2s), and with the network idle, do a nice big file transfer of the kind that usually falls over badly after a while. Writing a single 30GB to 100 GB file from a client to the server is a good way to make this behaviour show up very clearly.

THE SOLUTION:

Either un-dedup and don't use dedup, or move to 12 and migrate your data to force a rewrite to a set of disks that includes special SSD vdevs - and make sure they are big enough, and good ones, and (of course!) redundant. At this point I'd say wait for 12-RELEASE, or at least 12-RC1 which is due out in a couple of weeks and fixes some things that are important. But even 12-BETA1 was rock solid for data safety AFAIK and BETA2 is niice.

That's the real fix here.

You can also maybe try TrueNAS 12 tunables to load (and not unload) spacemap metadata and preload (and not unload) DDT at boot, and keep L2ARC warm if any, and maybe reserve a certain amount of RAM/ARC for non-evictable ZFS metadata ("vfs.zfs.arc.meta_min" as loader I think?), if you have plenty of RAM.

Note that tunables alone didn't help much on 11.x for me, because the all important tunables are missing - a way to tell it to preload the entire DDT so it never has to do that scale of DDT 4k read "on the spot" during a live file transfer. Or, having loaded it the slow way over time, to keep it "warm" in L2ARC at least and not lose it from speedy access.

You just solved a problem that plagued me for THREE FREAKING YEARS!!!

What's your paypal address...?

Stilez · Sep 1, 2020

TrumanHW said:
You just solved a problem that plagued me for THREE FREAKING YEARS!!!

What's your paypal address...?

If serious, please redirect anything you feel inclined to give, to any of the following, and no need to say more but thank you deeply, it will mean a lot: OpenZFS foundation, reporters without frontiers, medicine without frontiers, and (because its dear to my partner) any bona fide group focusing on positive work for the broader trans community/individuals.

If not serious, don't think twice, ignore the above.

Also, yay!! Glad!!

rickyrickyatx · Jan 7, 2022

Stilez said:
While there are many valid points in the above answers, the basic diagnosis so far is incorrect. The correct explanation is obscure and very different.

Once I saw the very clear description of symptoms in your first post I looked for dedup being on in your screenshots/outputs, and knew I'd see it. (Ill also guess you may be running fast network connections too, maybe 2.5 - 10 gigabit +, see later on). This collection of baffling behaviours is exactly what I had in the past, and this is what's actually happening.

The good news is, if your other hardware is good enough (sufficient RAM etc), your issues will probably be 100% resolved if you switch to special metadata SSD vdevs in TrueNAS-12 and replicate your pool to move metadata there (zfs send -R | zfs recv). That will fix it.

THE ISSUE:

In brief, deduplication places an immense demand on the system. Everyone knows it demands a lot of your RAM capacity. What's far less well known is the incredible levels of demand that dedup *also* places on 4k random IO and (not so relevant here) CPU for hashing.

With dedup enabled, *every* block read or written will *also* need multiple dedup table (DDT) entries read or written. That's inherent with dedup. All blocks are potentially deduped not just the file data. That's what dedup is - you're reducing pool space requirements at the cost of very high 4k random IO and CPU usage (for dedup hashing). All those blocks need DDT entries looking up, for any read/write.

To give an idea of scale, my pool has 40 TB actual data deduped to 13.9 TB (about 3x) and needs almost 200 *million* DDT entries to dedup it. Those 200 million DDT entries are each just a few hundred bytes long, so read or write, it's all pure 4k random IO. Yours is much less deduped as well (~1.7 x). You can see how many entries are in your dedup table using zpool status -Dv.

Without an all-SSD pool or (v12 only) SSD special vdevs, that 4k random IO is what's ultimately destroying your pools responsiveness and triggering the hangs. But it's doing it in a nasty indirect way, and taking the networking buffers and network IO capability down, along with it.

You can check this conclusively using gstat, but I know what you'll see.

Reboot (to clear any dedup/metadata/cached data from RAM/ARC/L2ARC). After reboot, run gstat on the console so it's not working across a network (something like gstat -acsp -I 2s), and with the network idle, do a nice big file transfer of the kind that usually falls over badly after a while. Writing a single 30GB to 100 GB file from a client to the server is a good way to make this behaviour show up very clearly.

As the transfer progresses, and when your file transfer slows, stalls, or gets close to misbehaving, watch what disk IO is going on, and also the block sizes that predominate. Also for HDDs listen for when they start and stop chattering, compared to.what gstat is saying is going on.

Intuitively for a single big file being written, you'd probably expect to see lots of disk writing (for example 128K writes if its a single file and the pool has plenty of free space). Instead you'll probably see its largely sticking in long phases of processing hundreds or maybe thousands of 4k or mixed size reads.

(The mixed size is if your system.is also.stalling on loading spacemaps. But DDT is always 4k, and that's the vast majority of the problem. Another thing you should see, but probably won't, is a regular 5 second heartbeat as ZFS builds up IO data for 5 seconds, and then writes it to disk - transaction groups, or TXG's. You won't see that nice clean presentation in your pool for long if at all, because of the issues describe below. After upgrading to special vdevs, that heartbeat came back on my system with gstat)

That high demanding level of 4k IO inherent in dedup, is what sets off a cascade of escalating system problems.

ZFS treats the DDT access as part of the block access generally, so for throttle purposes, it doesn't seem to notice that your disk IO is now invoking in the background much larger numbers of 4k random IO to get the DDT data to allow it to execute the actual writes. Also spacemap data to know where to put it (spacemaps are also 4k random metadata too...)

When I say dedup invokes highly demanding levels of 4k IO, the sheer scale of this isn't obvious at first. My deduped pool on TrueNAS-12-BETA which *doesn't* have this issue any more, sucks up 4k IOs at at 1/2 *million* reads/writes a second, and not just briefly but long and often enough for it to be a regular level.of disk IO. Not just the few thousand 4k IOPS which HDDs can deliver.

View attachment 41232

That rather shocking figure isn't exaggerated.
This was gstat on my pool, once I moved DDT to special vdevs on 12-BETA1.
See the 2 x 1/4 million 4K IOPS on both of the mirrored metadata vdevs?
(mine were writes not reads, because this was during replication)
That's the backlog your pool is choking on, and unable to process because an HDD pool just can't do that.

So the read/write disk throttles don't control that 4k metadata access as much as needed, and eventually a backlog builds up from this 4k IO pressure, which the pool just can't respond quickly enough to. Because its a deduped pool, *nothing* can happen poolwise until the relevant DDTs for all data to be processed have been loaded into RAM... except they can't be read nearly fast enough and the backlog is building up *really* badly*.

By this point you've got minutes to tens of minutes of 4k IO backlog already built up, even if it were processed at full speed on a fast HDD pool with plenty of RAM, and even if nothing else arrives.

The next step in the cascade is that RAM fills up with the backlogged file handling queue. That's why it runs happily anyway for a while at good speeds. ZFS is using RAM to backlog the file transfers. It might tell the source to slow down a little, but by and large at this point everything's still mostly happy and nothing's in meltdown... yet ("the system will burst extremely fast (900MB/s+) for a few seconds as expected. Then the speeds will settle down to about 250MB/s for about a minute")

But that can't last forever. Eventually if the demanded file read/write is enough GB, or ongoing enough, the backlog starts to invoke a more serious RAM throttle as RAM fills up to the maximum it's allowed to use. So the next step in the cascade is triggered within the wider OS (not ZFS), telling whatever is sending the data to slow down even more, until the backlog clears a little.

Usually that solves it. But in this case, youve got ZFS massively choking on 4k already, and *gigabytes* of RAM of incoming data from your file *already* accepted into.a queue/buffer waiting to *further* transact. All of that data needs its own 4k IOs on a deduped pool - a backlog that can *easily* trigger 0.4 to 4 *million* 4k random IOs for the related DDT records while processing (depending on the size of the backlog in RAM, your dedup level, what's already in ARC and many other things). And your pool is simply floundering, unable to begin making inroads into that massive 4k IO demand.

So the usual remedial action doesn't work as expected. The source is indeed told to slow down. The source for your files is network traffic. The way a networked server tells a networked source to send slower, or pause sending a bit for catch-up, is to lower the TCP window to tell the source not to send as much. At a pinch if the problem continues it can lower it all the way to zero ("zero window"). If you use netstat or tcpdump in console (look up tcpdump tcp window, or tcpdump zero window, on Google) or Wireshark or tcpdump on the client, you'll see that's being handled absolutely correctly. Unfortunately it's just not much help at this point. Whatever is sent, still can't get out of the network buffers into the file system stack (i.e., VFS or ZFS) because those are still choking, and they're going to continue choking for ages.

So TCP window never gets to really bounce back. It's hammered down to near zero constantly for minutes, because the system simply can't accept more data. Any opening in buffers (network or file system) is jnstantly stalled by backed up 4k and makes no real.difference whatsoever, it just becomes another item in a stalled backlog that stops anything new being accepted on more than bytes at a time. That's a continuing situation.

(Note: Network speed is a contributing/triggering factor as well. It's not that 10G is an issue, far from it. But 1 gigabit networking acts as a 120MB/sec throttle from the start on the incoming file data. ZFS might or might not just about cope with the 4k IO load at that speed. But it usually doesnt stand a chance on anything more, which can fill the pipeline 2.5 - 100x faster. The slower speed of 1G Ethernet means the backlog also builds up much slower too, so there's also much more time to get through the workload. I never saw this situation happen when by chance I downgraded my 10G LAN link to 1G. But I guess you might see it even on 1G, with some combinations of CPU speed, pool IO speed, and RAM size.)

By now the 4k DDT traffic has backlogged ZFS IO, the file system stack, and now the network stack, which has told the source "don't send for a while" by signalling a zero or very low TCP window.....

The source checks for a while, "Can I send yet", but the network stack never really stops being hammered, so 90% of the time the reply is "NO, WAIT!" Eventually after a while, by chance, it'll get told "NO WAIT!" each time it asks for a few times in a row (a client might only check about 4-6 times over the span of a minute if refused). At that point, the client gives up and assumes an error or issue has occurred at the remote end. Thats your SMB/NFS/iSCSI session dying. Your sessions can't reconnect for the same reason they disconnected in the first place - the network packets for negotiating a reconnection are also being stalled or getting "TCP zero window" responses as well. ("The system will become unresponsive ... It will stay hung for upwards of 5 minutes sometimes causing applications to fail along with any file transfers").

But that's not the end of the fun. The same network stack that handles those, also handles webUI and SSH, and even if those were low traffic, they also get told "TCP zero window" too, and at some point the same happens to them, and they conclude the server isn't responding properly and disconnect *their* sessions as well, after a minute or so of continual zero ability to send data across the network to the NAS ("ssh will disconnect, the WebGUI becomes unresponsive").

And finally, after all that, the sting in the tail. With no more incoming data, ZFS can finally and slowly catch up. That takes a few minutes. But what's happened? Oh dear! Your file transactions got cancelled. Which means immediately there's a new set of ZFS tasks to run, becsause ZFS/Samba/NFS/ctl (for iscsi) has to now *unwrite* all of that stuff again, to unwind the part completed transaction and keep the pool clean(!).

While RAM is emptying, Samba/NFS/ctl and your network stack can fill it up way faster than ZFS can process it with its 4k overhead, so whatever happens in effect it's kept at TCP zero window throughout, until it's almost totally clean.

After all that, finally, your server can begin to reaccept data properly. That's when your clients can finally reconnect.

LOCAL (NON-NETWORKED) FILE TRANSFERS:

Be aware this can also happen just as easily in local file transfers within the server too, not just networked. But those don't disconnect so utterly as a rule, because they are might more tightly managed by the same OS, so there's much less to fall over, and much better knowledge if an actual fault has arisen instead of just judging by network reaponses. They just have huge long pauses and slowdowns, instead, in many cases.

THE SOLUTION:

Either un-dedup and don't use dedup, or try switching to a 1G (or slower) network link if you're on anything fast, or move to 12 and migrate your data to force a rewrite to a set of disks that includes special SSD vdevs - and make sure they are big enough, and good ones, and (of course!) redundant. At this point I'd say wait for 12-RELEASE, or at least 12-RC1 which is due out in a couple of weeks and fixes some things that are important. But even 12-BETA1 was rock solid for data safety AFAIK and BETA2 is niice.

That's the real fix here.

You can also maybe try TrueNAS 12 tunables to load (and not unload) spacemap metadata and preload (and not unload) DDT at boot, and keep L2ARC warm if any, and maybe reserve a certain amount of RAM/ARC for non-evictable ZFS metadata ("vfs.zfs.arc.meta_min" as loader I think?), if you have plenty of RAM.

Note that tunables alone didn't help much on 11.x for me, because the all important tunables are missing - a way to tell it to preload the entire DDT so it never has to do that scale of DDT 4k read "on the spot" during a live file transfer. Or, having loaded it the slow way over time, to keep it "warm" in L2ARC at least and not lose it from speedy access.

Thank you for this writeup! I had initially enabled dedup on my VM backup datastore, thinking that it would be helpful but it turns it was using much more RAM than I thought it would and I was on my way to having a similar problem to OP. I've since rebuilt that datastore to not use defrag, and my statistics in arc_summary around metadata cache hits seems to be more normal now.

I probably would have been fighting a similar issue as OP for quite some time if I hadn't run across your post here.

Stilez · Jan 7, 2022

rickyrickyatx said:
Thank you for this writeup! I had initially enabled dedup on my VM backup datastore, thinking that it would be helpful but it turns it was using much more RAM than I thought it would and I was on my way to having a similar problem to OP. I've since rebuilt that datastore to not use defrag, and my statistics in arc_summary around metadata cache hits seems to be more normal now.

I probably would have been fighting a similar issue as OP for quite some time if I hadn't run across your post here.

Good we help one another, isn't it!!

Arwen · Jan 8, 2022

@Stilez Perhaps you can take your post, and make a resource or sticky thread on it. Maybe using a title that indicates using De-Dup is problem prone;

De-Dup, if you can't trouble shoot it, don't use it
De-Dup, How not to make it suck
Must not use De-Dup, (or if you must, what you need to know)

Stilez · Jan 8, 2022

Arwen said:
@Stilez Perhaps you can take your post, and make a resource or sticky thread on it. Maybe using a title that indicates using De-Dup is problem prone;

De-Dup, if you can't trouble shoot it, don't use it

De-Dup, How not to make it suck

Must not use De-Dup, (or if you must, what you need to know)

You mean like

My experiments in building a home server capable of handling fast + consistent deduplication

AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies and multiple transfers, using TrueNAS 12 with a deduped pool and default config...

www.truenas.com

or

A bit about SSD perfomance and Optane SSDs, when you're planning your next SSD....

NOTE: I'll be referring in this page, to a type of SSD developed by Intel and Micron, called 3D X-Point (pronounced "crosspoint"). It's most widely sold as Intel's Optane. The Optane devices I mean are things like the 900p, 905p, and P48xx...

www.truenas.com

? :)

There isnt a problem with dedup, and it doesn't need catchy scare titles. There is a mild issue with ZFS throttling of hash functions (scrub, dedup) and was an issue with optimising DDT access prior to special vdev introduction. Meaning, theres so much emphasis on RAM as a limiting resource, people dont remember CPU and storage IO are also both resources that can be starved too. They're known issues, now.

Arwen · Jan 8, 2022

Understood. My main comment was that you wrote a long, and very useful post, instead of linking to another with the answer.

Stilez · Jan 8, 2022

Well, the answer is in the TrueNAS docs themselves.....

ZFS Deduplication

Provides general information on ZFS deduplication in TrueNAS,hardware recommendations, and useful deduplication CLI commands.

www.truenas.com

and especially the troubleshooting section "identifying inadequate hardware"

TrumanHW · Jan 17, 2022

Arwen said:
@Stilez Perhaps you can take your post, and make a resource or sticky thread on it. Maybe using a title that indicates using De-Dup is problem prone;

De-Dup, if you can't trouble shoot it, don't use it

De-Dup, How not to make it suck

Must not use De-Dup, (or if you must, what you need to know)

I believe Matt Ahrens (username on youtube: ahrens maybe?) gave a lecture on De-Dupe and laid out some plans to drastically improving it.

That said, I'd never use it indiscriminately again, and would only do so with designated Datasets I expected to specifically benefit from dedupe.

Maybe, hosting VMs, etc., if the boot volume were an iSCSI target and independent from the data volume.
Even then, it'd probably make booting images simultaneously, slower.
When I used it for backup images (sparse images) for recovered data, etc., it was BRUTAL. We're talking SUB 1KB/sec brutal.
It's a performance holocaust in my book. Justly earned the categorical imperative, 'Never Again.'

Stilez · Jan 17, 2022

TrumanHW said:
I believe Matt Ahrens (username on youtube: ahrens maybe?) gave a lecture on De-Dupe and laid out some plans to drastically improving it.

That said, I'd never use it indiscriminately again, and would only do so with designated Datasets I expected to specifically benefit from dedupe.

Maybe, hosting VMs, etc., if the boot volume were an iSCSI target and independent from the data volume.
Even then, it'd probably make booting images simultaneously, slower.
When I used it for backup images (sparse images) for recovered data, etc., it was BRUTAL. We're talking SUB 1KB/sec brutal.
It's a performance holocaust in my book. Justly earned the categorical imperative, 'Never Again.'

Well, dedup like many technologies, should never be used indiscriminately, that's for sure.

You won't get performance you want, if you indiscriminately "that sounds pretty let's have it!" (Not saying you did, but the point is, on enterprise grade tools one can't arbitrarily expect to get away with not RTM and noting its warnings on certain deployment configurations - and dedup is heavily warned in the docs, more than most things).

Again, know your tools, and one must decide. Dedup is there. If one wants it, and finds it advantageous, then the flip side will be heavier duty hardware that can support it, or, trashed performance.

Pick either. But one won't get to pick none, and shouldn't complain at the inevitability of a compromise between different wishes, requiring choices to be made what matters more and how to balance them.

In my case dedup and performance mattered more. I use mirrors not RAIDZ, which costs a lot more, I use multiple mirror vdevs for parallelism which costs more, I use a better CPU than I might for extra cores, which costs more, I use very high performance mirrored SSDs for metadata which costs more, I accept the extra disks require better PSU which costs more....

Like everything in IT, it comes down to knowing what you want, and what you'll have to compromise to get it. I wanted resilience and performance with dedup, because its still worth it for.me, after the extreme extra cost of hardware that's needed to make it not suck.

Equally, "this isn't tenable and I don't need it that badly anyway, so let's never use it again" is also valid too. Depends your mindset, priorities, risks you want to avoid or benefits you want to gain, and uses.

The docs try to guide on that, but ultimately nobody can know what's best for a given use, except the user/s doing it. Try things till you find what works for you, which sounds exactly like what you did, and what I did too.

TrumanHW · Jan 18, 2022

Stilez said:
Well, dedup like many technologies, should never be used indiscriminately, that's for sure.

You won't get performance you want, if you indiscriminately "that sounds pretty let's have it!" (Not saying you did, but the point is, on enterprise grade tools one can't arbitrarily expect to get away with not RTM and noting its warnings on certain deployment configurations - and dedup is heavily warned in the docs, more than most things).

Again, know your tools, and one must decide. Dedup is there. If one wants it, and finds it advantageous, then the flip side will be heavier duty hardware that can support it, or, trashed performance.

Pick either. But one won't get to pick none, and shouldn't complain at the inevitability of a compromise between different wishes, requiring choices to be made what matters more and how to balance them.

In my case dedup and performance mattered more. I use mirrors not RAIDZ, which costs a lot more, I use multiple mirror vdevs for parallelism which costs more, I use a better CPU than I might for extra cores, which costs more, I use very high performance mirrored SSDs for metadata which costs more, I accept the extra disks require better PSU which costs more....

Like everything in IT, it comes down to knowing what you want, and what you'll have to compromise to get it. I wanted resilience and performance with dedup, because its still worth it for.me, after the extreme extra cost of hardware that's needed to make it not suck.

Equally, "this isn't tenable and I don't need it that badly anyway, so let's never use it again" is also valid too. Depends your mindset, priorities, risks you want to avoid or benefits you want to gain, and uses.

The docs try to guide on that, but ultimately nobody can know what's best for a given use, except the user/s doing it. Try things till you find what works for you, which sounds exactly like what you did, and what I did too.

Dude, I'm pretty sure it was YOU who actually (originally) like ... a couple of years ago or so ..? DIAGNOSED and SOLVED a problem I'd endured and been on a QUEST to solve for over 2-3 years!!! After I (indiscriminately) enabled dedupe. Or maybe someone pointed me towards a thread you'd written (and for which I was BEYOND grateful!). On an array I built back in ....... 2018? Maybe 2017 ? And back then, I don't think there were the kinds of warnings present now (understandably, as it's always a matter of competing priorities & putting out the biggest fire). I mean NO ONE had inferred, asked, mentioned or written anything I'd ever read* about dedupe being (potentially) BRUTAL on performance.

But yes, for all the benefits dedupe *can* offer, of course it's at an expense. And frankly, with the special vDevs ..? That you can actually provide one for dedupe ..? Is pretty awesome now (and by itself is indicative -- or should be -- of dedupe's performance requirements). I'd imagine there are some scenarios in which dedupe makes sense. I'd love to see the kind of hardware required to result in dedupe providing equal performance (without hitting another bottleneck) while saving substantial space.

There will always be 3 desires for which you can only pick (as many as) 2. Good fast and cheap, pick 2, right ..?
For me? After my time was raped by dedupe !? It'll be a LONG TIME before I EVER experiment with dedupe again! lol. I mean, I might test it every now and then to see how brutal it is ... but I'll never implement it willy nilly.

Still, there does appear to be a future for dedupe ... IF the kinds of things Matt Ahren mentioned ... and, if the cost of the GB in the special vDev more than big enough for your dedupe references saves you more than the value of storage capacity it saves. I'll just be one of the slower people to commit to it.

Stilez · Jan 19, 2022

TrumanHW said:
Dude, I'm pretty sure it was YOU who actually (originally) like ... a couple of years ago or so ..? DIAGNOSED and SOLVED a problem I'd endured and been on a QUEST to solve for over 2-3 years!!! After I (indiscriminately) enabled dedupe. Or maybe someone pointed me towards a thread you'd written (and for which I was BEYOND grateful!). On an array I built back in ....... 2018? Maybe 2017 ? And back then, I don't think there were the kinds of warnings present now (understandably, as it's always a matter of competing priorities & putting out the biggest fire). I mean NO ONE had inferred, asked, mentioned or written anything I'd ever read* about dedupe being (potentially) BRUTAL on performance.

But yes, for all the benefits dedupe *can* offer, of course it's at an expense. And frankly, with the special vDevs ..? That you can actually provide one for dedupe ..? Is pretty awesome now (and by itself is indicative -- or should be -- of dedupe's performance requirements). I'd imagine there are some scenarios in which dedupe makes sense. I'd love to see the kind of hardware required to result in dedupe providing equal performance (without hitting another bottleneck) while saving substantial space.

There will always be 3 desires for which you can only pick (as many as) 2. Good fast and cheap, pick 2, right ..?
For me? After my time was raped by dedupe !? It'll be a LONG TIME before I EVER experiment with dedupe again! lol. I mean, I might test it every now and then to see how brutal it is ... but I'll never implement it willy nilly.

Still, there does appear to be a future for dedupe ... IF the kinds of things Matt Ahren mentioned ... and, if the cost of the GB in the special vDev more than big enough for your dedupe references saves you more than the value of storage capacity it saves. I'll just be one of the slower people to commit to it.

Literally, the problem wasn't really known/recognised, or at least I've found not explicit references to it, beforehand. Right now the big bottleneck.seems to be CPU (hashing) after 4k IO.

TrumanHW · Jan 27, 2022

Stilez said:
Literally, the problem wasn't really known/recognised, or at least I've found not explicit references to it, beforehand. Right now the big bottleneck.seems to be CPU (hashing) after 4k IO.

Yeah, there's another thread in which I asked exactly that ... because there are spots in which I'm transferring ANY media other than video and the performance just becomes BRUUUTAL! Like, sometimes down in the low-KB range.

I do NOT understand POSIX or machine code, so if that's what my q translates to, I don't want to waste anyone's time.

Just seems weird for an 8 HD RAIDz2 to be so slow. (And I'm talking Read performance).
Slower than the max IOPs of a single drive even, right..? What is [THIS] caused by ..?
By that I mean ... Which hardware is responsible for this bottleneck ?

Video turns out to be the vast majority of the space I consume. But what's the real bottleneck !?

I can make a 4-8 SATA SSD array (should indicate if it's an IOPs thing for the smaller amount of non-video data I have.

The systems I'm testing this on are:
T320 6c 2.5GHz / 3.5GHz
48GB 1333MHz ECC
8x HGST UltraStar 3.5-in 7200rpm

(I believe you've commented on other threads of mine, but at the risk of being redundant):
I'm planning to try making an 8-10 NVMe dev all flash array using a:

• Dell PowerEdge R730xd
• Dual CPU E5-2600 v4
• ≥ 8x NVMe – RAIDz2 array (though if I'm slow enough maybe dRAID will be out)
• SLOG - (maybe another 8GB RMS-200)
• L2arc - 400GB P5800x Optane or NV-DIMMs..? (need to research)
• Dual SFP28 card

Stilez · Jan 27, 2022

TrumanHW said:
Yeah, there's another thread in which I asked exactly that ... because there are spots in which I'm transferring ANY media other than video and the performance just becomes BRUUUTAL! Like, sometimes down in the low-KB range.

I do NOT understand POSIX or machine code, so if that's what my q translates to, I don't want to waste anyone's time.

Just seems weird for an 8 HD RAIDz2 to be so slow. (And I'm talking Read performance).
Slower than the max IOPs of a single drive even, right..? What is [THIS] caused by ..?
By that I mean ... Which hardware is responsible for this bottleneck ?

Video turns out to be the vast majority of the space I consume. But what's the real bottleneck !?

I can make a 4-8 SATA SSD array (should indicate if it's an IOPs thing for the smaller amount of non-video data I have.

The systems I'm testing this on are:
T320 6c 2.5GHz / 3.5GHz
48GB 1333MHz ECC
8x HGST UltraStar 3.5-in 7200rpm

(I believe you've commented on other threads of mine, but at the risk of being redundant):
I'm planning to try making an 8-10 NVMe dev all flash array using a:

• Dell PowerEdge R730xd
• Dual CPU E5-2600 v4
• ≥ 8x NVMe – RAIDz2 array (though if I'm slow enough maybe dRAID will be out)
• SLOG - (maybe another 8GB RMS-200)
• L2arc - 400GB P5800x Optane or NV-DIMMs..? (need to research)
• Dual SFP28 card

This is kinda "throw everything at it".

If your pool is on SSD, and you have decent RAM, you probably won't benefit from L2ARC,because the data will probably take similar time to pull off L2ARC as the SSD pool. So it probably won't help any bottleneck.

Let's try to diagnose not take guesses, if we can....

It'll take some command line commands, but I'll try to make sure you don't have to go working them out. Hopefully that's ok. Without hard data, its tricky to advise what's going on, and hard data requires running commands.

Key question - will you enable dedup or not? And what are your pool stats? Can you dump the output of zpool status -Dv POOLNAME, zpool list -v POOLNAME, and zfs list -r POOLNAME (from memory those are the relevant ones!), and let's see what your data looks like, now.

Also, can you run gstat -acsp -I 2s, on console, when its runing slow, and maybe photo or screencap or describe what it shows the disks are doing when its crawling. What does top say, regarding memory and CPU? Last, I can't remember the exact command (Google?) but use wireshark on a client or tcpdump on the server, and watch the TCP window sizes being reported when this happens. Particularly, do they regularly drop down low and *stay low* for much of the time? Low here would be a few tens or at most 2-3 hundred bytes.

Between that lot, we should get a good idea what to say.

Important Announcement for the TrueNAS Community.

Baffling Performance issues with large zfs pool

Patron

Guru

Wizard

Guru

Contributor

Guru

Cadet

Guru

MVP

Guru

MVP

Guru

Contributor

Guru

Contributor

Guru

Contributor

Guru

Similar threads