Dedup Special Drive?

kspare

Guru
Joined
Feb 19, 2015
Messages
508
With the intro of the special dedup drive, does this make dedup more usable now? Less reliant on memory?

We run almost a 100 terminal servers on a storage server so we are just starting to test true nas and determine if a special meta drive makes sense or maybe start u sing dedup?

we typically run 11 mirrored vdevs, 256gb ram, mirrored zil and a 2tb l2arc with 40gb network cards....

Looking for some more speed! lol
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
does this make dedup more usable now? Less reliant on memory?

Hi,

There is no evidence about that yet... So far, the best way to do dedup is to identify your duplicated data and managed them as such above ZFS.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Hi,

There is no evidence about that yet... So far, the best way to do dedup is to identify your duplicated data and managed them as such above ZFS.
If I was able to do what you suggest...I wouldn’t need dedup? I have 100 terminal servers....all with the same version of office and windows....I’ve identified my data already.......I want to take advantage of Zfs dedup for performance and space saving.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
I want to take advantage of Zfs dedup for performance and space saving.

Dedup will eat your RAM at about 5G per TB of storage. So for 12 mirrors of 2TB each, that is about 120G of RAM, twice as much as you have already. Even if the new way of doing dedup should prove to be twice lighter, you would still need 100% of your RAM for dedup only.

Also, consider that with 100 servers hitting 24 drives, that is more than 4 servers per drive for read access. For write, you are down to 100 servers for 12 mirrors, so 8 servers per drive. There is no way you can achieve stellar performance with such a load.

What are the other bottlenecks like network bandwidth and latency anyway ?

Your setup is overloaded and dedup will make it worst.

If I was able to do what you suggest...I wouldn’t need dedup? I have 100 terminal servers....all with the same version of office and windows....

Then why do you have a 100 of them ?? Beef up these servers and reduce the number you have. Considering Windows only requires about 4G of RAM, to reduce from 100 servers to 50 would save you 200G of RAM. Also, if they are used only for Office, know that it is cloud-based now... So have people work in the Cloud version instead of a pseudo-local version.

But clearly, from what you described here, dedup is looking the same here as it is everywhere else : A clear No-Go and an efficient way to shoot yourself in the foot.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Why are you assuming it's the server in my signature?

Are you telling me Dedup automatically uses all the memory even if the storage has nothing on it? so if I have 12tb of usable storage but only put on 1tb of data, it will use 200gb of ram? Because that is what you are basically telling me? Or are you so anxious to reply that people are away you generally don't run over 50% storage with zfs to keep performance high.

Buddy. You need to chill...if someone is running 100 vms, they likely aren't an amateur home build....something commercial is going on.

You internet warriors with so much to say crack me up. Just say you don't know or need more details to give a resonable answer....people like you are what make people not want to post on here with your know it all answers.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Why are you assuming it's the server in my signature?

Because in your original post, you talked about "generally using" something like that...

Post the complete description of the actual server then...

Are you telling me Dedup automatically uses all the memory even if the storage has nothing on it?

Yes and No... At the activation when there is nothing, dedup will not use anything. But as soon as you start doing dedup, the process will start building the deduplication table. That table will grow at about 5GB per TB of storage. That entire dedup table will need to be loaded in RAM because each and every request need to be searched in that table. Also, that table will basically never shrink.

don't run over 50% storage with zfs to keep performance high.

Performances drop pretty quickly with increased pool load... Go see this post by @jgreco for more info.

The other aspect of it is that it is often pretty hard and slow to add storage space in an existing pool. When you reach the 50% mark, it is a good moment to start thinking about it because if you wait until 85%, you may not put it in place in time.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Great, thanks for reasonable responses. This is all common knowldge to someone looking into dedup on this site. As you said initially you don't know the answer to my questions, so i'll do some R&D to figure it out.

Because I know you are wondering...

My boxes are all dual cpu, 256gb ecc ram, 12gb lsi hba, 4tb sata (mostly iron wolf pro) 2tb P3700 L2Arc, 800gb P3700 * 2 OP to 80gb for ZIL, and 2 800GB P3700 to assign to either meta drive or dedup in a mirror. along with chelsios 40gb eth
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401

kspare

Guru
Joined
Feb 19, 2015
Messages
508
dual socket, 48 threads. 24 drives, we keep 2 as cold online spares.
 

Tony-1971

Contributor
Joined
Oct 1, 2016
Messages
147

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
dual socket, 48 threads.

So surely not short in CPU power. You may try to increase compression instead. Dedup charges you in RAM for buying space. Compression charges you CPU time instead. Way easier and cheaper to pay in CPU time instead of RAM.

11 vDevs is similar to the server in your signature, so what I told you about the bottleneck is still true. It is even worst because not all 24 drives are in use. So you have 22 drives for read requests and 11 for write requests. You are now at 9 servers per vDev for your write requests.

For more speed, what you need is more drives.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
So surely not short in CPU power. You may try to increase compression instead. Dedup charges you in RAM for buying space. Compression charges you CPU time instead. Way easier and cheaper to pay in CPU time instead of RAM.

11 vDevs is similar to the server in your signature, so what I told you about the bottleneck is still true. It is even worst because not all 24 drives are in use. So you have 22 drives for read requests and 11 for write requests. You are now at 9 servers per vDev for your write requests.

For more speed, what you need is more drives.

It's works incredibly well. I can't tell you how many hours of reading i've put in to develop my tunes. We host almost 1000 concurrent users on this and it works great! We are constantly looking for ways to improve our storage.

Next steps are R&D with the special vdevs, nvdimms (this will require a new systemboard, cpu and ram) and next gen nvme's.....always looking to improve.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I have 100 terminal servers....all with the same version of office and windows.

I assume using something like VMware Horizon was ruled out due to budgetary/licensing reasons? This is almost a perfect use case for it.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Hi OP,
I think you raise several very important questions that I have also been trying to find the answers to. This is something I have been actively researching myself, so I hope my findings and research can help you. I apologise for the formatting of this, I am on my phone in bed and can't sleep lol.

From my current understanding of the way in which this is implemented, the way in which Dedupe works now with a special allocation VDEV dedicated for it isn't much different than how it worked prior.

As we all know, the community has been preaching to not use Dedupe with ZFS for years because the implementation sucks. NetApp, Pure, etc all do it significantly better in sane ways that don't use all of your memory.

The special allocation class doesn't fix the underlying issues with Dedupe, it is a bandaid. They are leveraging the fact that hardware has become faster and are using that to their advantage to cheat. Matt Ahrens, one of the original ZFS team members, proposed some solutions to actually fix Dedupe in 2017. https://youtu.be/PYxFDBgxFS8 None of that has yet materialized.

So, how did it actually work before the introduction of the special allocation class? Again, This is my current understanding. The Dedupe table needs to be persistent, because if you lose it during a reboot because it is only in RAM, you're data has holes in it. Anything that was deduped would be garbage. Because of this, The table was stored in your main pool's VDEVs like any other data. Similar in that way to ARC, if your Dedupe table in memory doesn't have what it's looking for, it looks on the disk. Because it's not just a cache like the ARC though, this has a more severe performance penalty than an ARC miss. You end up in a position where you're main data VDEVs are spending significant amounts of their IO just doing lookups in the Dedupe table.

The special allocation class simply just moves this function off of your main VDEVs onto dedicated devices to help alleviate some of the performance penalty. You can think of it similarly to an L2ARC, but the data isn't just a cache and if the special VDEV is lost your data has holes in it and your whole pool is dead.

The jury is still out on the effectiveness of this method. No one has really reported on it . Do you still need to follow the old ratio wisdom? I think it depends on how fast your VDEV is and how busy your pool is. You would also need to do some tuning on your ARC to make sure you can fit enough metadata in it to keep your L2ARC populated and cache hits up.


What makes this all more frustrating is that the operating system doesn't even expose Dedupe ratios in the GUI. In fact, It literally ignores the deduped data and counts it as if it weren't deduped, so from a glance it looks like it's not doing anything at all. There's no way of even visualizing it's effectiveness without digging into the command line. It will show you the I/O of your Dedupe VDEV though, so you will be able to see how busy it is at any given time and track trends.

I hate to say it, but if you're here to ask questions of the community here, I don't think you're going to get more satisfactory answers. This feature really hasn't been tested by many people here. These forums are a really good place for people to rehash the old adadges, but many here are unwilling to go beyond the "orthodoxy" as you have probably noticed by being chastised for even asking the question.

I can at least say that I have tested this new feature in a very limited way, and it does work. I created a test VM, cloned it 4 times and watched as I saw I/O hitting the special VDEV as expected. I can't tell you what performance would be like when it has to go to the special VDEV for reads because my DDT was only like 1200 bytes so it was in RAM. But, it definitely is writing to that VDEV instead of the main pool, which is saving some additional I/O that would be wasted on writes. That being said, write performance seemed about the same with Dedupe on vs with Dedupe off, which is a good sign. But you are limited by the special vdevs speed for your writes. If your main pool is 11 VDEVs wide, you would probably need more than a single mirror of NVME to actually be fast enough to keep up, my guess would be that you would want 2 special vdevs for Dedupe. Because you are using spinners here, I'm not really sure though. The 5GB per TB guide assumed there wasn't a second usable tier, but I don't think they would have to be particularly large, just particularly fast. There isn't a formula or a reference, so this is all just guessing

I think the only way were going to get a real answer here is if someone does testing in a production environment and actually reports their findings here.

i would like to say, though, that Honeybadger does have an interesting point. Doing your workload in VMWare Horizon in a pool (not a ZFS pool lol) of either instant or linked clones would effectively solve the same problem at a different layer in the stack. Obviously, that would not help you if you are also doing traditional Datacenter workloads beyond your current VDI application however.
 
Last edited:

garm

Wizard
Joined
Aug 19, 2017
Messages
1,556
Honest question, why is never snapshots the go to solution for this kind of situation? It would be my starting point if I ever was in a position to solve for n-clients. One base install snapshotted as a template, then cloned to each client. Only the delta generated after initiation would require storage. And windows user data compress quite well in my experience.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Because infrastructure is often built in such a way that was out of our control to begin with and we need to solve these sorts of problems from different angles. Not everything is always cookie cutter. If it were we wouldn't have jobs.

At work we run all of our VDI stuff in Horizon with linked clones. So I don't have the same problem as the OP here. Much like you said, all I am storing is the original snapshot and the Delta's.

I do have a ton of windows server and Linux VMs running in production there as well. They are not all identical and have different purposes. But since they all have the same base OSs, and many of the same programs, there is alot of duplicate data. It really does add up. Because my Pure SAN there does Dedupe and compression, my datasets are between 4x-10x smaller than they otherwise would be. FreeNAS (and now TrueNAS) can do this too, but every time someone asks the question everyone comes out in droves scolding the people for even asking about Dedupe.

We just got, for the first time in a very long time, a new feature which directly impacts our ability to actually use Dedupe as a tool and it seems like no one cares. It's actually infuriating really. I get that there are other solutions to this problem at times, but doing Dedupe at the block level is the only real one-size-fits-all.
 

garm

Wizard
Joined
Aug 19, 2017
Messages
1,556
Because infrastructure is often built in such a way that was out of our control to begin with

I still don’t follow,


I do have a ton of windows server and Linux VMs running in production there as well. They are not all identical and have different purposes.
Sure, but you can still template your service portfolio and build new deployments on them. Dedup on ZFS will only be new data any way.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Linked clones may have worked in the very beginning, but sometimes you just just doMy have a crystal ball and things change. Not even an option anymore. Which is why dedup is attractive to me. I’m just waiting for one more nvme drive to show up and then I can play around with a meta drive and dedup. I’ll out 10-20 vms on there and see what happens... very easy to move on and off for me as I have 3 of freenas storage servers
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Linked clones may have worked in the very beginning, but sometimes you just just doMy have a crystal ball and things change. Not even an option anymore. Which is why dedup is attractive to me. I’m just waiting for one more nvme drive to show up and then I can play around with a meta drive and dedup. I’ll out 10-20 vms on there and see what happens... very easy to move on and off for me as I have 3 of freenas storage servers
Please share what you find. I am very interested to see the results.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Please share what you find. I am very interested to see the results.
i'm going to try a test with our veeam backup repository....i'll try a full backup copy job onto it and see how it does. It's about 60 seperate jobs of all my vm's. Sits at about 25.53 TiB with lz4 compression on.
 
Top