TrueNAS CORE - Metadata drives and/or ZIL L2ARC drives

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Is there a rule-of-thumb for sizing the metadrive?

I did a little testing last night....it's very dependant on block size and *not* compression.

The data size in this case is around 726gb

The data column is the alloc data size, meta is the alloc data on the special meta drive, and block size is the dataset block size.

We use our storage for terminal servers and it's looking like 32k block size is kind of the sweet spot for performance. but 64k isn't far behind.

Data Meta Block
464G 605M 128k
726G 1.12G 64k no compression
472G 1.11G 64k
491G 2.14G 32k
728G 2.15G 32K no compression
574G 8.2G 8k
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
I did a little testing last night....it's very dependant on block size and *not* compression.

The data size in this case is around 726gb

The data column is the alloc data size, meta is the alloc data on the special meta drive, and block size is the dataset block size.

We use our storage for terminal servers and it's looking like 32k block size is kind of the sweet spot for performance. but 64k isn't far behind.

Data Meta Block
464G 605M 128k
726G 1.12G 64k no compression
472G 1.11G 64k
491G 2.14G 32k
728G 2.15G 32K no compression
574G 8.2G 8k
Can you explain a bit what we're looking at, and the interpretation? Thanks!
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Can you explain a bit what we're looking at, and the interpretation? Thanks!
I moved 5 vm's that totalted 726 gb to an nfs store on my freenas box. I adjusted the vdev to the block sizes to test compression and metadata sizes.
 

Spoon

Dabbler
Joined
May 24, 2015
Messages
25
All three have a different purpose and are intended to solve different problems.

Do you have too much "hot data" to fit in your RAM, and can't fit or afford more? Deploy L2ARC as a second-level read cache. It's not RAM fast but it beats spinning disk.

Do you have synchronous writes (eg: NFS clients, hypervisors, OLTP DBs) that have to return fast while remaining safe? Add an SLOG device, which will accelerate the response of the "write to stable storage" data flow.

Do you do a lot of metadata-heavy operations (directory listing, scans, small updates to many dozens/hundreds of thousands of files?) and find that they take far too long? Here's where a dedicated metadata vdev may help - rather than your back-end data vdevs spending time handling metadata reads and writes (especially if they're spinning disks) push this data to separate flash devices which are much faster at the random I/O that's inherent to metadata work.

If your "metadata-heavy" operations primarily result in metadata reads then you can achieve most of the same results with adding L2ARC and setting the secondarycache=meta property. It doesn't help metadata writes - for that, you need the special vdevs. But an important note though is that unlike L2ARC, where all contents are volatile and loss of the device just results in slowdown, a metadata vdev needs redundancy - metadata copies only exist on this vdev, and if you lose it, the whole pool is toast. (Edit: This also means you can't remove it after it's been added. Edit2: Apparently you can, but unless checksum on removal is implemented now, you could end up in a bad spot if you have a read error during a copy of pool metadata.) Mirrors will be heavily recommended and a triple-mirror wouldn't be unreasonable. Depending on your write workload you'll also want to use decently well-rated drives in terms of endurance. While it isn't directly "drinking from the firehose" like an SLOG is for sync writes, you will want to scale it based on how update-heavy your workload is. I'd say "mixed use" SSDs with a 1-3 DWPD rating depending on size are where you'd want to land. Not cheap 0.1 DWPD QLC, but not 25+ DWPD Optane either. (Edit3: Of course, if you can afford Optane, it's the best solution. It also doesn't suffer from increase read latency in a mixed-workload scenario.)

StorageReview did a YouTube podcast with @Kris Moore where he talks a bit about the metadata-only vdevs (sorry, "Fusion Pools" ;) ) and I've linked to that timestamp (hopefully) below.


Edit: You'll notice I didn't mention deduplication here. While you can add separate vdevs for metadata and in fact dedup tables explicitly in TN12 it doesn't relieve the additional memory pressure or extra considerations that arise from enabling deduplication. If you truly needed it before, you'll be very happy to have these vdevs available as it will increase performance (possibly significantly) but if you didn't use it before, don't think that you can just drop a couple SSDs into a Fusion Pool as special type=dedup and enable it globally. It's still a recipe for pain if you do it wrong.

What happens to the ZIL when you add a metadata vdev, does the ZIL get written across all the vdevs in that pool including the metadata vdev? In my case I have 3 mirror vdevs and one special vdev made up with a pair of optanes. Does the ZIL exist on the mirrors only, across all the mirrors and special vdev, or just on the special vdev? Can you tune it so that the ZIL goes to the metadata vdev only? There are no SLOGs in the system.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
What happens to the ZIL when you add a metadata vdev, does the ZIL get written across all the vdevs in that pool including the metadata vdev? In my case I have 3 mirror vdevs and one special vdev made up with a pair of optanes. Does the ZIL exist on the mirrors only, across all the mirrors and special vdev, or just on the special vdev? Can you tune it so that the ZIL goes to the metadata vdev only? There are no SLOGs in the system.
They do a fundamentally different job, so you can have none, 1, 2, or all 3, of metadata/ZIL/L2ARC, any combination.

A ZIL vdev captures upcoming writes, so that if you lose power, the data changes that never made it to the main pool (data or.metadata writes) aren't lost. If you don't lose power, then the ZIL data makes it to the main data disks (including special vdevs if any) a few seconds later, at which point nobody cares if its on the ZIL disks as well, those can be deleted pretty much, once the data is safe on pool disks. They just safeguard data in that in-between transient few seconds.

if you don't have a ZIL vdev, then the same capturing and safeguarding takes place, but its stored in the pool, tracked separate from actual final.pool.data. so its across all pool vdevs in that case. If the pool has a special vdev, I dont know whether or not an in-pool ZIL (not separate disks) will also be confined just to the special vdev too.

The special vdevs allows you to tell ZFS, "all data goes to these disks, BUT if its metadata, confine it to THESE disks if possible, because they're a lot faster". Other than that, they are just part of the pool, like any other disks.
 

Spoon

Dabbler
Joined
May 24, 2015
Messages
25
They do a fundamentally different job, so you can have none, 1, 2, or all 3, of metadata/ZIL/L2ARC, any combination.

A ZIL vdev captures upcoming writes, so that if you lose power, the data changes that never made it to the main pool (data or.metadata writes) aren't lost. If you don't lose power, then the ZIL data makes it to the main data disks (including special vdevs if any) a few seconds later, at which point nobody cares if its on the ZIL disks as well, those can be deleted pretty much, once the data is safe on pool disks. They just safeguard data in that in-between transient few seconds.

if you don't have a ZIL vdev, then the same capturing and safeguarding takes place, but its stored in the pool, tracked separate from actual final.pool.data. so its across all pool vdevs in that case. If the pool has a special vdev, I dont know whether or not an in-pool ZIL (not separate disks) will also be confined just to the special vdev too.

The special vdevs allows you to tell ZFS, "all data goes to these disks, BUT if its metadata, confine it to THESE disks if possible, because they're a lot faster". Other than that, they are just part of the pool, like any other disks.

This was my understanding. I was just wondering if the ZIL tracking would take place across the the entire pool including the special vdev. It would be advantageous in this pools instance, to confine it to the special vdev as they are better equipped for the task. Is there a way I could “see” where the ZIL tracking is being written too? I understand I could partition off and make a SLOG as well, I’m just curious how it works.

For reference the pool consists currently of three single qvo 8TB ssd vdevs and dual 900p 280gb as metadata vdev. Running on Truenas scale (All to be converted to full mirrors due course)
 
Last edited:
Top