How to size a SSD VDEV for all current metadata, now that v12 beta is out?

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
I'll be using allocation classes (devices) on v12. Now that v12 beta is here, I'd like to get testing. But obviously I need to size the underlying SSDs appropriately, to buy them. But how do I calculate what size metadata vdev I will need?

  1. How do I obtain the current total size of metadata (all of it - spacemaps, DDT, file system pointers & records, etc!) using the zdb or other commands?
  2. Is this the same as the amount of on-disk pool space that will be needed (MB/GB of disk space), or do I have 2 figures, one for "metadata size" and another for "on-disk metadata size", because of block sizes vs. data record sizes? If they are different, how do I find the amount of on-disk space currently used/needed?

Note that for spacemap/metaslab efficiency I'm using non-standard metaslab count (~1000 per vdev) and spacemap block size (~16K), based on a Matt Ahrens BSDCan paper (I think?), if that matters.

Also I will be migrating data to a new pool with blank disks, on 12.0 (send/recv), to ensure that metadata is all split out to SSD.

Thanks
 
Last edited:

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
Unfortunately I don't know any on-disk metadata counters other then `zdb` having parameters to calculate dedup tables sizes. If you store large files or zvols, except dedup tables I would expect the most of metadata size be in indirect block pointers, that take ~256 bytes (2 copy of 128 bytes block pointer) per data block, that depending on the block/record size can be between 0.02% and 1% of pool capacity. If you store many very small files, then things like directories, file attributes, ACLs and other may also play some role. Also ZFS may be configured to store some small blocks on the special vdev, but that is hard to predict, since it completely depends on what is stored.

Since special vdevs can only be mirrors, not RAIDZs, penalty between on-disk and in-RAM formats should not be huge, since minimal allocation is a 4KB, that may become visible only on some small structures. But remember that on disk metadata are stored in 2 or sometimes even 3 copies, and on top of that you need 2x vdev mirroring. Together that gives flash requirement of at least 4x of metadata size.

I would propose you to create a test pool with a special vdev, copy there some of your specific data and see how much space `zpool list -v` report you allocated on the special vdev. My bet would be somewhere between 2 and 10 GB of special vdev capacity per TB of data.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
@mav@

May I check my understanding?

So if dedup is enabled, and I use zpool status -DD to get block count + unqiue block count, then I'm looking at, say

(number of unique blocks) x (bytes per DDT entry on disk) for DDT if enabled for the entire pool
+
(number of unique blocks) x (256 bytes per block) for pointers x 2 copies
+
Some additional space for pointers to pointers (B-tree), ACLs, 3rd copies of pointers, extra pool objects such as properties/snapshots etc (assumed fairly minor by comparison)
So does this mean I'm looking at the in-use pool capacity used for metadata being roughly proportionate to number of unique blocks, at around 500-1000 bytes per unique block total probably, if the pool is fully deduped?
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
My configuration...
1x 2-disk mirror 120GB SATA SSD (Supermicro DOM) for boot
6x 2-disk mirrored 10TB spindle (WDC Gold) vdevs
1x 2-disk mirror 1TB nvme (Samsung EVO 970) dedupe vdev
1x single 1TB nvme cache (Samsung PRO 950)
1x 2-disk mirror sata ssd 240GB (Intel DC-series) special vdev
dedupe is on at the root of the pool, default compression is also enabled.

This gives me a pretty good breakdown when using zpool list -v to show allocations / usage among the VDEVs:

1595171933470.png


My usage by data footprint is mostly SMB shares with some VM on ZVOL LUNs; to that, system is entirely doing replica data/snapshots from an 11.3 box without workloads on it directly right now, but I have done so with the VM workloads (moved it back due to a snapshot deletion issue where it stalled the VMs during the snapshot deletion process)

My takeaway from this and based on what I've seen of the disk utilizations....

A) Special VDEV does NOT, out of the box, get hit that hard in terms of IO or consumption. 13.2G allocated/consumed based on 4.41T of data (roughly...that may be a deduped number). That said, out of the box, "small IO" is not part of the equation; only metadata gets written to the special VDEV, but if Dedupe is used, and no Dedupe VDEV is configured, then Dedupe is also handled here too.

B) Dedupe VDEVs gets POUNDED; all said. Tons of IO; and the dedupe tables are a bit larger than what I'd expected.

Just providing this to give some real-ish world numbers as to what the dedupe/special allocations vs actual storage would look like; though, different workloads would slide numbers around quite a bit.

Also, my dedupe stats:
1595173038613.png
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Dedup vdevs don't just get pounded. Its all 4k random IO, plus latency, and most disks love that stuff :-/
 
Top