dedupe setup with special allocation classes query

tangles · May 12, 2020

Hi All,

My eyes lit up when I saw that dedupe blocks can now be written to special vdevs, and so I read the Linux man about special allocation classes and there's not a huge amount of info that I could see about dedupe in this area. (I'm yet to test in a VM)

Q1. If I create a pool with special vdevs, enable deduplication on the pool, I'm assuming that the dedupe tables (DDTs) would be written to the special vdevs along with the other default blocks described.
What happens if the space on the special vdevs fills up? would the DDT then spew over onto other (slower) vdevs like it does with the other classes? I'm assuming yes.

Q2. How big are dedupe blocks?
I see there's an optional flag to exclude DDTs from special vdevs, but not the other way around to exclude all other types of blocks except dedupe blocks. Whether this is desirable or not is not is something else to ponder I guess, but is it possible to use the small block flag to exclude all blocks except DDT blocks? I'm thinking it isn't possible.

Background to this is a Graphic Design business… Designers are notorious for duplicating a previously similar job for a client to get the core files needed such as logos, themes etc and so am wondering about enabling dedupe with special allocation vdevs now that crazy amounts of ram aren't necessary.

Thoughts?

Ericloewe · May 12, 2020

tangles said:
. If I create a pool with special vdevs, enable deduplication on the pool, I'm assuming that the dedupe tables (DDTs) would be written to the special vdevs along with the other default blocks described.

Yup.

tangles said:
What happens if the space on the special vdevs fills up? would the DDT then spew over onto other (slower) vdevs like it does with the other classes? I'm assuming yes.

Yeah. There's work to limit the DDT to the available space on the metadata vdev, but that hasn't been reviewed yet.

tangles said:
Q2. How big are dedupe blocks?

That seems to be a complicated question. Add compression (which is definitely a plus) and you can forget having a specific number. The approach right now seems to be headed in the direction of "we tested this a bunch and came up with this value, we'll use it as a reference and warn people that it's only an estimate". Don't know what that value is, it'd be best to follow the pull requests around dedup to figure that out.

tangles said:
Background to this is a Graphic Design business… Designers are notorious for duplicating a previously similar job for a client to get the core files needed such as logos, themes etc and so am wondering about enabling dedupe with special allocation vdevs now that crazy amounts of ram aren't necessary.

Any chance you can use snapshots and clones? Say, hand them a script that "copies" an existing job and does the heavy lifting behind their backs. It might get complicated if they go "I'll copy this project which was a copy of another project", but it should work if the data for the old projects is static and kept around forever...

HoneyBadger · May 12, 2020

tangles said:
so am wondering about enabling dedupe with special allocation vdevs now that crazy amounts of ram aren't necessary.

FYI, the use of special/dedup vdevs doesn't necessarily negate the recommendation for "crazy amounts of RAM" - it just makes the consequences of running out of RAM less catastrophic. For best performance you still want the entire deduplication table to live in RAM, since doing a lookup against RAM is way faster than SSD, just as SSD is way faster than HDD. Writes get a major benefit from being able to update the metadata/ddt on SSD as well.

Just don't think it'll be a magical solution; you may still want to increase arc_meta_limit in order to allocate more RAM to DDTs.

JoeAtWork · May 12, 2020

HoneyBadger said:
Just don't think it'll be a magical solution; you may still want to increase arc_meta_limit in order to allocate more RAM to DDTs.

What about the Intel Optane? The Optane latency is several orders of magnitude lower than SSD. If Intel and Micron make products maybe the price will continue to come down so they are more reasonable. I noted some of the low end Optane has higher write latency(30us vs 10us) when compared to the higher end products that are more pricey, i.e. M10 vs 905P...

Did you also see that Intel is putting Optane products and SSD TLC on the same stick, LOL I did not and it was back in 2019? Not sure if that even works yet with FreeBSD, it looks like a hardware thing where the bifurcation has to go down to 2 PCIe lanes. Sounds like Intel made something that all the PLX switches cannot understand and only the latest chip set from Intel is functional. QLC from Intel is crap .
Optane and QLC NAND flash

HoneyBadger · May 12, 2020

JoeAtWork said:
What about the Intel Optane? The Optane latency is several orders of magnitude lower than SSD.

Not really. Fast modern SAS SSDs (like the Ultrastar DC SS530) hit mid-20us latency; Optane gets down to 10us. Twice as fast, sure; but RAM measures its latency in nanoseconds. Tough to beat those numbers.

JoeAtWork said:
Not sure if that even works yet with FreeBSD, it looks like a hardware thing where the bifurcation has to go down to 2 PCIe lanes.

It's board-dependent, and most server boards will only bifurcate down to single device x4 rather than x2. Could be interesting if it does work at some point and you could put four H10 cards on a 4x M.2 carrier - would give you 4x 32GB 3D XPoint and 4x 1TB QLC in a single PCIe x16 slot.

Ericloewe · May 12, 2020

Let's talk latency, since I'm waiting for ~~my code to compile~~ my satellite image to process.

PCIe is low-latency when compared to ATA, but your bits have a lot of hoops to jump through:

The CPU calls whatever function in the kernel that will get bits from an abstract block device
The CPU calls whatever function in the NVMe stack tells the SSD to get your bits
The commands go through the PCIe root to the SSD
The SSD controller needs to look up where your bits are
The SSD needs to set things up internally to send you your bits
The SSD sets up DMA to wherever it was told to do DMA
(non/volatile memory latency goes here)
The SSD transfers stuff
The CPU has the data

Versus accessing main memory:

The CPU tries to access something at address X
(DRAM latency goes here)
The IMC gets the data from DRAM
The CPU has the data

This is why NVDIMMs exist. They cut out half of those steps by sort of attaching NAND/whatever directly to the CPU's address space.

tangles · May 12, 2020

Thank you,

That clears up a few things for me. Especially if the DDTs will try to occupy RAM still before the special vdev. Most design houses don't tend to have more than a dozen designers and so I'm not sure just how crippling having DDT on flash would actually be. I've always kept clear of Dedupe knowing that it requires huge RAM requirements that exceed ROI for these small to medium businesses.

There's no chance I'm going to get Graphic Designers to do run anything resembling a snapshot/clone. They're just don't think like that and they are environments where there is no onsite admin unless an email alert is sent or a phone call. A dataset per client is the current setup which works very well for most media houses I'm supporting.

Have got a VM running now where I've put 4 virtual disks on a 10GB network and using local flash for special vdevs in an attempt to mimic slow v fast physical disks and so will run through the scenario and see how it goes.

Thanks all.

JoeAtWork · May 12, 2020

Ericloewe said:
This is why NVDIMMs exist. They cut out half of those steps by sort of attaching NAND/whatever directly to the CPU's address space.

So the 3D XPoint was just a stop gap, where everyone now wants to be is NVDIMM? Intel/Micron lost the chance to sell a lot of kit and replace the SATA bus for caching. Now that means I have to start looking for an upgraded server for TrueNAS Core. LOL

Ericloewe · May 13, 2020

JoeAtWork said:
So the 3D XPoint was just a stop gap, where everyone now wants to be is NVDIMM?

No, they're completely different problems. NVDIMMs are an alternative to the interface, NVMe, SAS, SATA, etc. Not a straight swap, since you have a portion of the physical address space that is non-volatile instead of having a block device that you access via a certain interface.
3D XPoint is an alternative to NAND flash, in the same way that you have multiple flash technologies.

You can have NVDIMMs that use NAND flash, you can have NVDIMMs that use battery-backed RAM, etc.

Important Announcement for the TrueNAS Community.

dedupe setup with special allocation classes query

tangles

Dabbler

Ericloewe

Server Wrangler

HoneyBadger

actually does care

JoeAtWork

Contributor

HoneyBadger

actually does care

Ericloewe

Server Wrangler

tangles

Dabbler

JoeAtWork

Contributor

Ericloewe

Server Wrangler

Similar threads