Dedup Clarification

Joined
Aug 11, 2021
Messages
4
I wanted to post my experience with dedup.

People tend to throw around numbers like, "for every TB of data you will need 5GB of RAM for dedup." Or you here or read things like, "the trade-off with deduplication is reduced server RAM/CPU/SSD performance." I think that's such a bad generalization. Dedup depends on the number of blocks and the smaller the recordsize means smaller block sizes and therefore many blocks. Lots of blocks means that the cost of dedup goes up. It can be calculated quite simply as [Number of Blocks] * 320 = [Size of Dedup table]. The trick is calculating how many blocks you will have. Use a typical 4MB picture. If the recordsize is 128KB then you will have about 32 blocks. That's not so bad.

Originally my vdev was set up with dedup with a recordsize of 4KB. It seemed like a good idea at the time. After all, if the recordsize is small enough then you would assume that they would be a lot of similar blocks. It didn't turn out that way. Very few blocks were duplicated and the small recordsize meant that there were lots of blocks. The hard drives were constantly working and the servers had a hard time doing anything.

So I recreated the vdev and set up a couple of datasets, one as "dedup" with a recordsize of 128k. My intent was to save my backups and pictures to this dataset since those things tend to contain duplicate files. As you can see below, the dataset has a nice 1.32 dedup ratio and uses about 1GB for 1TB of data. I also don't have problems with the hard drives or CPU. Note that the compression is doing almost nothing for me especially my large Media dataset.

1628748551863.png


1628748651205.png
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yeah, the XGB/1TB rule doesn't properly take into account block sizes, so in the end it tends to turn out really weird results depending on the specifics of the workload. If you have something where you have lots of larger duplicate files, you can end up with results like what you're seeing.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The 5GB RAM for 1TB disk "rule" assumes a 64K recordsize.

The real question is whether your workload benefits more from saving 310GB of disk space, or having 1.2GB more RAM available to ARC - and only you can answer that. Some users may also be able to have their cake and eat it too - saving space on disk, spend a little ARC on a DDT, but have a bunch of duplicate records end up in the remaining ARC to get a net positive effect. But then there's the impact of updating the on-disk DDTs to account for. But then there's the potential to add the special/meta vdevs to counteract that. But then you have to build that vdev in a sufficiently redundant manner.

I think what I'm getting at here is "make sure you've considered all the tradeoffs first" and it looks like you've done that here by selectively enabling it rather than just attempting to apply it across the board.
 
Joined
Aug 11, 2021
Messages
4
The 5GB RAM for 1TB disk "rule" assumes a 64K recordsize.

Ah, that makes sense. Thanks for the info.

The real question is whether your workload benefits more from saving 310GB of disk space, or having 1.2GB more RAM available to ARC - and only you can answer that. Some users may also be able to have their cake and eat it too - saving space on disk, spend a little ARC on a DDT, but have a bunch of duplicate records end up in the remaining ARC to get a net positive effect. But then there's the impact of updating the on-disk DDTs to account for. But then there's the potential to add the special/meta vdevs to counteract that. But then you have to build that vdev in a sufficiently redundant manner.

You're right. In my case I think it's worth it. Considering that I tend to dump photos and other files as "quick backups" there tends to be a lot of duplication. For example, you take a bunch of pictures on your phone or camera and then copy the DCIM folder to the server. A couple of months later, after having taken a bunch more pictures, you dump the DCIM folder again as a backup to the server. You keep doing this and gradually each folder contains a duplicate of the previous folder. The dedup makes this all convenient and efficient by keeping the file once and then just referencing the other occurrences of that same file in each subsequent backup. Sure, I could set up some kind of incremental backup or use syncthing or something like that to keep all the files synchronized but for now this is easier and I still win. It's also easier for my parents who are not used to computers. I just tell them to copy the folder. Simple.

I think what I'm getting at here is "make sure you've considered all the tradeoffs first" and it looks like you've done that here by selectively enabling it rather than just attempting to apply it across the board.

Yes, you're right and I think posts like this one and others that I've seen where people mention what they did and what effect that's had on their system is useful to other people who might be considering the same makeup.
 
Top