TrueNAS CORE - Metadata drives and/or ZIL L2ARC drives

dpearcefl · Jun 26, 2020

I am excited about the new metadrive feature but confused as to whether ZIL and L2ARC drives should be used at the same time? They sound like they perform different functions, but it there a point of diminishing returns? Are there any guidelines when a metadrive should be used over ZIL and L2ARC drives?

Thanks.

HoneyBadger · Jun 30, 2020

dpearcefl said:
They sound like they perform different functions, but it there a point of diminishing returns? Are there any guidelines when a metadrive should be used over ZIL and L2ARC drives?

All three have a different purpose and are intended to solve different problems.

Do you have too much "hot data" to fit in your RAM, and can't fit or afford more? Deploy L2ARC as a second-level read cache. It's not RAM fast but it beats spinning disk.

Do you have synchronous writes (eg: NFS clients, hypervisors, OLTP DBs) that have to return fast while remaining safe? Add an SLOG device, which will accelerate the response of the "write to stable storage" data flow.

Do you do a lot of metadata-heavy operations (directory listing, scans, small updates to many dozens/hundreds of thousands of files?) and find that they take far too long? Here's where a dedicated metadata vdev may help - rather than your back-end data vdevs spending time handling metadata reads and writes (especially if they're spinning disks) push this data to separate flash devices which are much faster at the random I/O that's inherent to metadata work.

If your "metadata-heavy" operations primarily result in metadata reads then you can achieve most of the same results with adding L2ARC and setting the secondarycache=meta property. It doesn't help metadata writes - for that, you need the special vdevs. But an important note though is that unlike L2ARC, where all contents are volatile and loss of the device just results in slowdown, a metadata vdev needs redundancy - metadata copies only exist on this vdev, and if you lose it, the whole pool is toast. (Edit: This also means you can't remove it after it's been added. Edit2: Apparently you can, but unless checksum on removal is implemented now, you could end up in a bad spot if you have a read error during a copy of pool metadata.) Mirrors will be heavily recommended and a triple-mirror wouldn't be unreasonable. Depending on your write workload you'll also want to use decently well-rated drives in terms of endurance. While it isn't directly "drinking from the firehose" like an SLOG is for sync writes, you will want to scale it based on how update-heavy your workload is. I'd say "mixed use" SSDs with a 1-3 DWPD rating depending on size are where you'd want to land. Not cheap 0.1 DWPD QLC, but not 25+ DWPD Optane either. (Edit3: Of course, if you can afford Optane, it's the best solution. It also doesn't suffer from increase read latency in a mixed-workload scenario.)

StorageReview did a YouTube podcast with @Kris Moore where he talks a bit about the metadata-only vdevs (sorry, "Fusion Pools" ;) ) and I've linked to that timestamp (hopefully) below.

Podcast #50: Kris Moore, iXsystems

In this week’s podcast Brian sits down with Kris Moore from iXsystems. Kris gives an update on the progress of merging FreeNAS and the TrueNAS code bases tog...

youtu.be

Edit: You'll notice I didn't mention deduplication here. While you can add separate vdevs for metadata and in fact dedup tables explicitly in TN12 it doesn't relieve the additional memory pressure or extra considerations that arise from enabling deduplication. If you truly needed it before, you'll be very happy to have these vdevs available as it will increase performance (possibly significantly) but if you didn't use it before, don't think that you can just drop a couple SSDs into a Fusion Pool as special type=dedup and enable it globally. It's still a recipe for pain if you do it wrong.

dpearcefl · Jun 30, 2020

Thank you for this most-helpful description. I'm sure many other people will benefit.

Can metadata drives be added after the fact? Or only at creation of the vdev?

Is there a rule-of-thumb for sizing the metadrive?

Patrick M. Hausen · Jun 30, 2020

They are a separate vdev and so can be added after the fact. They only cover new writes from that point onwards. They cannot be removed.
An SLOG or an L2ARC can be removed.

dpearcefl · Jun 30, 2020

Is there a rule-of-thumb for sizing the metadrive?

HoneyBadger · Jun 30, 2020

dpearcefl said:
Is there a rule-of-thumb for sizing the metadrive?

"1% of usable storage" but it's only a thumbrule and actual usage can vary wildly.

Like deduplication, the amount of metadata generated depends on the record count, not the amount of data stored. 1T of large files with a 64K average recordsize might only generate 1G of metadata (0.1%) but 1T of ZVOLs used to back VMFS datastores with average 4K and 8K records might generate 10G (1%)

Patrick M. Hausen · Jun 30, 2020

At a rough estimate of 1% you can of course ridiculously outgrow that with current technology and go for, say, 5% and never reach that capacity. 2 TB NVME M.2 drives ... check. And if this is an environment where that performance boost counts you probably won't even waste money. Your complete system will similarly be 10 to 20 times the cost of those drives.

Stilez · Jul 6, 2020

HoneyBadger said:
Edit: You'll notice I didn't mention deduplication here. While you can add separate vdevs for metadata and in fact dedup tables explicitly in TN12 it doesn't relieve the additional memory pressure or extra considerations that arise from enabling deduplication. If you truly needed it before, you'll be very happy to have these vdevs available as it will increase performance (possibly significantly) but if you didn't use it before, don't think that you can just drop a couple SSDs into a Fusion Pool as special type=dedup and enable it globally. It's still a recipe for pain if you do it wrong.

Adding to this excellent post, specifically about dedup. Dedup places different pressures on the system:

The best known one is a lot of RAM. ZFS will need access to the dedup table records (DDT) potentially for *every* disk block read/write. So you need enough RAM to hold those records in very high speed access (your ARC cache in RAM), and be sure they won't be evicted from ARC. That means you need RAM, or at worst a very good L2ARC, possibly configured as being dedicated to metadata.
...
You also need a good CPU. Deduplication checks and generates data hashes, and that puts an extra burden on the CPU. It'll slow you down a bit.
...
You need storage hardware that handles 4k I/O *really* well, until OpenZFS 2.0 progresses a bit more. The cached DDT doesn't stay cached forever, and persistent L2ARC isn't here yet. When you reboot, no matter your RAM and L2ARC, *every* block of data you R/W on the pool, will need the related deduplication records fetched. You can see this for example when you try and copy a large (10GB+) file across 10 gigabit ethernet using SMB or iSCSI, and simultaneously run gstat on the *NAS console. What you may naively *expect* is a flurry of large (1 MB + ) disk writes. as ZFS very efficiently writes out your file. But what you *get* is almost zero disk writes, and a *huge* flurry of hundreds of thousands of 4k disk *reads*. That's the relevant blocks of the DDT loading from cold, which it needs before it can do a thing with your actual file operation. Plus write amplification which adds to the burden. Unfortunately spinning disks suck at 4K I/O and this can take *minutes*. So eventually your file operation can stall or time out.

*This* part of the problem is where a dedicated metadata VDEV *can* really help. Preloading DDT, persistent L2ARC, and other OpenZFS DDT improvements are all being worked on, so we should see a lot happen within a year maybe, or 2 at most? But until they are completed and rolled out, you have no way to force the DDT to reload and it'll be hell on your deduplicated pool, until all of the DDT has been needed (and therefore loaded) if it's held on HDD.

Dedicated VDEVs do NOT help with 1 or 2.

Ericloewe · Jul 22, 2020

Just a note: You can remove the entire metadata vdev, under the same conditions as "normal" device removal. Mostly, that means mirrors only.

HoneyBadger · Jul 22, 2020

Ericloewe said:
Just a note: You can remove the entire metadata vdev, under the same conditions as "normal" device removal. Mostly, that means mirrors only.

Really. I hadn't tested this but thought that special vdev removal wasn't yet supported. I don't think I'd encourage it though unless they've started computing checksums during the removal (last I saw this still wasn't done) - so you'd definitely want a remap/scrub afterwards.

Ericloewe · Jul 22, 2020

Same as any removal, scrub before and scrub afterwards for maximum safety.

Until we get BPR, but that'll be the day.

Yorick · Jul 22, 2020

Ericloewe said:
Until we get BPR

I stand ready to ski those Hellish Glades, on that day.

J Ree · Sep 20, 2020

How do i remove the special metadata vdev? I tried using the GUI, it threw an exception. (USING TN CORE RC1)

Stilez · Sep 21, 2020

J Ree said:
How do i remove the special metadata vdev? I tried using the GUI, it threw an exception. (USING TN CORE RC1)

Can you post the output of zpool list -v

J Ree · Sep 21, 2020

Stilez said:
Can you post the output of zpool list -v

Here you go!

NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
MAIN 22.0T 9.72T 12.3T - - 0% 44% 1.00x ONLINE /mnt
raidz2 21.8T 9.70T 12.1T - - 0% 44.5% - ONLINE
gptid/59a35a41-e66b-11ea-9bc0-00155d1b2208 - - - - - - - - ONLINE
gptid/59da79fc-e66b-11ea-9bc0-00155d1b2208 - - - - - - - - ONLINE
gptid/5a7c97c7-e66b-11ea-9bc0-00155d1b2208 - - - - - - - - ONLINE
gptid/5ad54ef6-e66b-11ea-9bc0-00155d1b2208 - - - - - - - - ONLINE
special - - - - - - - - -
mirror 222G 12.5G 209G - - 4% 5.65% - ONLINE
gptid/67c01a06-eda2-11ea-9dd4-00155d1b2208 - - - - - - - - ONLINE
gptid/67d2ef65-eda2-11ea-9dd4-00155d1b2208 - - - - - - - - ONLINE
logs - - - - - - - - -
gptid/27153ef4-ee07-11ea-9dd4-00155d1b2208 232G 340K 232G - - 0% 0.00% - ONLINE
cache - - - - - - - - -
gptid/590763e3-e66b-11ea-9bc0-00155d1b2208 466G 459G 6.88G - - 0% 98.5% - ONLINE
RSYNC 3.62T 4.92G 3.62T - - 0% 0% 1.00x ONLINE /mnt
gptid/772f3ff5-e692-11ea-9919-00155d1b2208 3.62T 4.92G 3.62T - - 0% 0.13% - ONLINE
freenas-boot 9.50G 2.52G 6.98G - - 6% 26% 1.00x ONLINE -
da0p2 9.50G 2.52G 6.98G - - 6% 26.5% -

Stilez · Sep 22, 2020

J Ree said:
How do i remove the special metadata vdev? I tried using the GUI, it threw an exception. (USING TN CORE RC1)

This is the output of the info for man zpool-remove, the command used to remove disks and vdevs from a pool. My emphasis added:

zpool remove [-npw] pool device...

Removes the specified device from the pool. This command

supports removing hot spare, cache, log, and both mirrored and

non-redundant primary top-level vdevs, including dedup and

special vdevs. When the primary pool storage includes a top-

level raidz vdev only hot spare, cache, and log devices can be

removed.

The Web UI threw an exception because the underlying command failed, although admittedly it should have handled it more gracefully. But the failure is genuine, not a bug.

You can't just remove the special vdev, as best I know, because of that restriction. You'd have to rebuilt (replicate) the pool.

J Ree · Sep 22, 2020

Stilez said:
Can you post the output of zpool list -v

Here is the pic as well

Stilez said:
This is the output of the info for man zpool-remove, the command used to remove disks and vdevs from a pool. My emphasis added:

zpool remove [-npw] pool device...
Removes the specified device from the pool. This command
supports removing hot spare, cache, log, and both mirrored and
non-redundant primary top-level vdevs, including dedup and
special vdevs. When the primary pool storage includes a top-
level raidz vdev only hot spare, cache, and log devices can be
removed.

The Web UI threw an exception because the underlying command failed, although admittedly it should have handled it more gracefully. But the failure is genuine, not a bug.

You can't just remove the special vdev, as best I know, because of that restriction. You'd have to rebuilt (replicate) the pool.

Ok thanks. Then why did someone on this post say you can remove it? I am getting contradicting info lol.

Stilez · Sep 22, 2020

Because not everyone, even the best intentioned, reads all the small print on man zpool-remove. Honest mistake, special vdevs are new in this version and its easy not to know the detailed limitations that can apply to removal

J Ree · Sep 22, 2020

Stilez said:
Because not everyone, even the best intentioned, reads all the small print on man zpool-remove. Honest mistake, special vdevs are new in this version and its easy not to know the detailed limitations that can apply to removal

Damn. I guess I'm f***Ed then. I don't wanna rebuild Q_Q

Ericloewe · Sep 22, 2020

I did say...

Ericloewe said:
You can remove the entire metadata vdev, under the same conditions as "normal" device removal.

The main condition is "the pool may only have mirrors".

Important Announcement for the TrueNAS Community.

TrueNAS CORE - Metadata drives and/or ZIL L2ARC drives

Contributor

actually does care

Contributor

Hall of Famer

Contributor

actually does care

Hall of Famer

Guru

Server Wrangler

actually does care

Server Wrangler

Wizard

Cadet

Guru

Cadet

Guru

Cadet

Guru

Cadet

Server Wrangler

Similar threads