Ways to leverage Optane in 12.0? (given 4k speed+latency, could Metadata+SLOG share devices, and is L2ARC less useful)

Stilez · Jul 9, 2020

So for this question, I need to give a bit of background.

BACKGROUND: MY SYSTEM UNDER 11.3:

My pool data is very large, very dedupable files (20TB of VMs), to the point that the extra hardware for dedup was worth it to save HDD costs (3.5 x dedup, I'm saving about 200TB of SAS3 raw disk cost with dedup).

To handle dedup, the structure under 11.3 has been:

4 core Xeon
3 way mirrored enterprise SAS3 for the pool (for speed/resilience)
256GB DDR4 @ 2400, to allow caching of the entire DDT and metadata as well as files, with ease. (I set a 140GB ARC reservation on metadata to prevent any eviction)
480GB Optane 905p (PCIE 3.0 x 4) L2ARC, added later, to try and improve things.
P3700 (PCIE) for SLOG/ZIL, although so far its been all FreeBSD+Samba so I havent had much sync IO to making use of it.

It only somewhat worked. The file system still really struggled, the reason being spacemap/metaslab inefficiency, and the need to preload DDT via 4k IO before any sizeable operation - reading relevant DDT from HDD took long enough for the operation to time out.

I largely fixed the metadata issue by rebuilding the pool and changing its block size/metaslab count (there's a Matt Ahrens paper on that), and also set a tunable to preloading it. But similar options don't existed for DDT. So despite my best efforts, when I write a big file >20GB, gstat showed a periodic flood of 4k DDT reads lasting 30 - 90 seconds a time, easily enough to stall the actual file operation.

Trying to load the relevant parts of a 50GB DDT using 4k IO honestly doesn't work :) Of course that left ARC warm, so rerunning the file operation worked very fast, but that's not an answer. But v12 was coming, so I bided my time.....

PROPOSED SYSTEM CHANGES WITH v12:

Under TrueNAS Core v12 I can offload metadata, spacemaps, and DDT, all to a special vdev chosen to handle 4k IO incredibly well. Also I notice for future that DDT preloading/persistent L2ARC and other things are in the OpenZFS pipeline.

So my v12 rebuild will have mirrored optane 905p 480GB as a special class vdev. With luck, the only remaining problem will be CPU demand of DDT hashing, and that's fixable with a better CPU if a problem. But then I thought......

Optane is so good at low queue depth 4k IO and generally, that it leaves me wondering about whether I can leverage this idea even further......

MY QUESTIONS:

So, I'm thinking of partitioning both Optanes' nominal 480GB, down to nominal 430GB + 50GB, so instead of 1 x 480GB mirror, ZFS sees a 430GB mirror + 50GB mirror. Redundancy is still good, and a 430GB mirror is still plenty for my foreseeable metadata+DDT needs.

Usually ZFS is recommended to use entire disks, but thats for efficiency purposes. I suspect that won't be an issue, or the gains may outweigh the losses.

As I see it, by moving SLOG to my special vdev devices, ZFS ends up with a mirrored Optane SLOG device that's probably many times faster (latency-wise) than my P3700, even if I allow for it sharing IO with a special metadata vdev, and moving from write only to mixed RW. And now it'll be redundant as well. And have mirrored speed on reads. And I'm not using sync IO much anyway so most of the time it's irrelevant.

Q: Is ditching the P3700 and moving SLOG to a partition on my mirrored Optane special vdev, a total win, or am I missing something?
I'm also considering whether I need my existing L2ARC any more. After all, it can't possibly be faster than the special vdev anyway, and it only exists to try and speed up metadata/DDT. There's plenty of RAM to cache all metadata and over 100 GB of file data as well. I only got the L2ARC to try and help with metadata/DDT caching/eviction. But with metadata/DDT on mirrored optane anyway, that seems redundant now.

Q: Is there any real benefit to keeping L2ARC given these changes, if I don't feel a need to cache file data beyond what RAM can hold, and if I set a sufficient reservation on ARC metadata to prevent eviction from RAM once loaded?

These seem straightforward but I'd like to check :)

HoneyBadger · Jul 10, 2020

"Give ZFS entire disks" is a Solarism, so it doesn't apply here, but there's another point of concern:

Q: Is ditching the P3700 and moving SLOG to a partition on my mirrored Optane special vdev, a total win, or am I missing something?

"You're missing something." Optane doesn't have the ability create multiple NVMe namespaces, so the total write bandwidth will be shared between the SLOG and special partition via the QoS method of "Thunderdome" - it's just going to be a battle, and I/O will likely be inconsistent as a result.

As an additional note, for small-block writes the Optane that much faster than the P3700. Optane only starts to run away from a 400G P3700 beyond 32K based on the SLOG testing thread:

(If it doesn't load, it's from this post - https://www.ixsystems.com/community...inding-the-best-slog.63521/page-4#post-481734 )

Q: Is there any real benefit to keeping L2ARC given these changes, if I don't feel a need to cache file data beyond what RAM can hold, and if I set a sufficient reservation on ARC metadata to prevent eviction from RAM once loaded?

Take a look at your current L2ARC hit rate to determine if it's serving anything of value before you toss it. Moving your meta and DDT to Optane will definitely lighten up the back-end vdevs for better service of actual data I/O, and there's also nothing saying you can't attach a couple of less-expensive SATA or SAS SSDs to use there instead of the Optane device(s).

Bonus edit - Check your arcstats as well and make sure that metadata isn't being unnecessarily evicted. I see a reference here ( https://www.ixsystems.com/community/threads/zfs-arc-metadata-size-minimum.74686/ ) to the "hidden tunable" of vfs.zfs.arc_meta_min that you can set. OpenZFS also has other tunables like zfs_arc_dnode_limit and an associated _percent but I don't know what there are under FreeBSD's current ZFS, or if they're exposed for tuning at all.

Stilez · Jul 11, 2020

Thanks!

For information, I saw that post, and a few others. What I'm reckoning is mixed IO - reading metadata/spacemaps/DDT, but also updating them. Possibly with usual ZFS write amplification (tree block checksums, spacemaps, copies, etc). But even Intel's datacentre SSDs have historically not handled sustained mixed loads well. This review (Tom's Hardware) looked at Optane vs highly reputed SSDs such as Samsung 960 Pro (enthusiast) and Intel 750 (datacentre).

That's what motivated me to Optane...... not the simple 4k random read, or 4k random write latency/speed, but the lack of plunging performance under mixed loads.

Stilez · Jul 12, 2020

Okay. It works. Even on default crude config (bare 12.0-BETA install and no special config) I get consistent and sustained high speeds for every which way I tested. One dropout, no idea why. WIll write up more, busy configuring right now.

HoneyBadger · Jul 17, 2020

Stilez said:
Thanks!

For information, I saw that post, and a few others. What I'm reckoning is mixed IO - reading metadata/spacemaps/DDT, but also updating them. Possibly with usual ZFS write amplification (tree block checksums, spacemaps, copies, etc). But even Intel's datacentre SSDs have historically not handled sustained mixed loads well. This review (Tom's Hardware) looked at Optane vs highly reputed SSDs such as Samsung 960 Pro (enthusiast) and Intel 750 (datacentre).

That's what motivated me to Optane...... not the simple 4k random read, or 4k random write latency/speed, but the lack of plunging performance under mixed loads.

That's one sexy linear relationship. It is geared towards "sequental mixed" rather than "random mixed" but I imagine that will only change the scale/scope of the results given Optane's ability to handle small-block and shallow-queue.

Some further backup can be found (and taken with a grain of salt, since it's from Intel themselves) in this whitepaper showing a comparison of the regular DC P3700 vs the Optane DC P4800X in a vSAN mixed workload (70/30 R/W split):

https://media.zones.com/images/pdf/...orage-increases-performance-solutionbrief.pdf

Relevant graphs have been snipped and pasted below:

And while we already know how good Optane is at writes, a little extra reinforcement for how good it is at continuing to provide good reads while under writes:

Optane could well be the exception to "don't split a device between read and write workloads" - even with no namespaces and an "every I/O for itself" QoS it seems to do extremely well.

Stilez · Jul 17, 2020

HoneyBadger said:
That's one sexy linear relationship. It is geared towards "sequental mixed" rather than "random mixed" but I imagine that will only change the scale/scope of the results given Optane's ability to handle small-block and shallow-queue.

Some further backup can be found (and taken with a grain of salt, since it's from Intel themselves) in this whitepaper showing a comparison of the regular DC P3700 vs the Optane DC P4800X in a vSAN mixed workload (70/30 R/W split):

https://media.zones.com/images/pdf/...orage-increases-performance-solutionbrief.pdf

Relevant graphs have been snipped and pasted below:

View attachment 40213

And while we already know how good Optane is at writes, a little extra reinforcement for how good it is at continuing to provide good reads while under writes:

View attachment 40214

Optane could well be the exception to "don't split a device between read and write workloads" - even with no namespaces and an "every I/O for itself" QoS it seems to do extremely well.

I hadn't seen those graphs/sources! Thank you!

HoneyBadger · Jul 17, 2020

Stilez said:
I hadn't seen those graphs/sources! Thank you!

Thank you for inspiring me to go digging. Have a few more:

https://www.anandtech.com/show/1120...-dive-into-3d-xpoint-enterprise-performance/7

Tempted to do a little bit of shenanigans with my little 32G M.2 Optane devices and see if the same "performance under mixed I/O" holds up, just at a lower scale.

Stilez · Jul 17, 2020

HoneyBadger said:
Tempted to do a little bit of shenanigans with my little 32G M.2 Optane devices and see if the same "performance under mixed I/O" holds up, just at a lower scale.

I thought about that. The trouble is that the tiny optanes I knew of, seem to be older 1st gen models with poor performance. Needs to be 900/905 era onward. I'd love to know what you get!!

HoneyBadger · Jul 17, 2020

Stilez said:
I thought about that. The trouble is that the tiny optanes I knew of, seem to be older 1st gen models with poor performance. Needs to be 900/905 era onward. I'd love to know what you get!!

Will post back if I get some results. In-guest numbers are pretty spiffy considering the low cost of these devices though.

Stilez · Jul 17, 2020

HoneyBadger said:
Will post back if I get some results. In-guest numbers are pretty spiffy considering the low cost of these devices though.

Update with IOPS and latency view as well? I think AS SSD has those, maybe CrystalDiskMark does as well?

Stilez · Aug 1, 2020

Well this is fun. Some hard data. Basic background:

3 way mirrored enterprise HDDs (3x10tb, 3x8tx, 3x8tb, 3x8tb as 4 mirrored vdevs)
2 mirrored Optane 905p 480GB, partitioned 325GB Special + 125 GB ZIL/SLOG. SAo they are doing both ZIL and vdev workloads combined. (okay 125G is way more than it should need!!)
Dedup SHA512 enabled - so there's going to be a lot of tricky IO as well as CPU demand
256GB RAM
TrueNAS Core 12-beta1
A sprinkling of tunables
One very heavy local ongoing workload - a 35 TB pool recplication to fill this pool.
Some performance output from zpool iostat -pw, showing the I/O latency and responsiveness.

Spot the Optane? At least, I *think* that's the Optane, I can't imagine that sync write latency reflects anything except ZIL/SLOG time. Look how beautifully tight the clustering is on the sync write queue! Not just extremely low, but uniformly extremely low. In fact, as we can see, it never gets to be anything other than extremely low!

For those wonderinfg about hosting both ZIL and VDEV on the same devices, this shoudl put the question to bed. That sync write queue clustered performance is onto the same 2 Optanes that are also hosting the metadata and dedup tables, that's being written at full speed as well....

Average pool replication rate - 400 MB/sec.

I decided this could be useful stuff, so I added a resource summing up some of this thread too.

Stilez · Aug 2, 2020

Also this dump, taken 28 hours (24 TB) into a 35 TB send/recv replication between my 11.3 pool and a newly created 12-BETA pool on the same machine. The 12-BETA pool is using mirrored optane special vdevs for metadata, and is dedup as well as the original pool.

Notice the hugely different latencies. I've configured it to treat any disk I/O that takes longer than 200 milliseconds as being "slow" (sysctl vfs.zfs.zio.slow_io_ms=200). It's a good way to find drives that are being queued up so much that they can't respond quickly, and how often it's happening.

The pool being *read* was created under 11.3. Its hard drives have typically reported around 0.8 - 2.4 million slow access events.
The HDDs in the newly created 12.0 pool are exactly the same model of HDD. They are also 2 way mirrors, although 4 not 3 HDD vdevs. But they are writing which is typically quite a bit slower than reading. They also have full use of the OpenZFS 2.0 tunables, features, and special vdevs for metadata and dedup. (both pools are fully deduped). We can allow that they have fewer accesses, because they can be written in large chunks not piecemeal laid down over time. After allowing for all of those, the OpenZFS 2.0 pool has a typical "slow access" event rate of about 13-18 thousand per device.
The special vdevs in the 12.0 pool are reporting just 900 slow events, or ~ 6% of the HDD slow event rate in that pool. That's despite handling all file and pool metadata, and all dedup data. (Although a lot of that is cached in ARC so it doesn't need reading back as much)

Short version, the 12-beta HDDs were "slow" (>200 ms) for disk I/O at about 1% of the rate that the 11.3 pool is doing. Because it's reporting total event counts, it also confirms they are running a lot more consistently low rates regarding latency, not just running at low rates.

Also a mass recursive delete of 957 snapshots, across 17 datasets (16269 snaps total) just took 48 minutes - dedup snap destroy rate down from tens of seconds per snap, to 1/6 of a second each.

That gain is almost entirely down to one reason: the metadata and dedup tables have been offloaded to a fast SSD vdev in 12-BETA, which can't be done in 11.3 or earlier.

Happiness!

Important Announcement for The TrueNAS Community.

Ways to leverage Optane in 12.0? (given 4k speed+latency, could Metadata+SLOG share devices, and is L2ARC less useful)

Stilez

Guru

HoneyBadger

actually does care

Stilez

Guru

Stilez

Guru

HoneyBadger

actually does care

Stilez

Guru

HoneyBadger

actually does care

Stilez

Guru

HoneyBadger

actually does care

Stilez

Guru

Stilez

Guru

Stilez

Guru

Similar threads

Important Announcement for The TrueNAS Community.