Deduplication in this scenario?

Status
Not open for further replies.

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
So I am currently experimenting with using FreeNAS as a backup device. As of right now I just have a USB drive that I am dumping them to, and I am trying to add another layer in between for dailies (then relegate the USB to weekly).

I have read a lot of the warnings about deduplication and RAM requirements, and I am aware of the risks associated. My pool is only 3.6 TB usable right now, and I have about 2.5 TB of backups that get partially overwritten daily. I turned on deduplication on my test box, and it provided very promising results:

Code:
dedup: DDT entries 6459091, size 1458 on disk, 235 in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    1.36M    173G    127G    131G    1.36M    173G    127G    131G
     2    1.29M    165G    116G    119G    3.72M    477G    332G    342G
     4    3.43M    439G    344G    352G    13.9M   1.74T   1.36T   1.39T
     8    77.1K   9.63G   6.90G   7.14G     684K   85.5G   61.1G   63.2G
    16    4.33K    554M    321M    341M    83.9K   10.5G   6.10G   6.48G
    32      778   97.2M   67.5M   69.9M    31.8K   3.97G   2.82G   2.91G
    64      104     13M   4.78M   5.37M    10.2K   1.27G    482M    544M
   128       46   5.75M   2.14M   2.43M    7.40K    947M    326M    373M
   256        3    384K      3K   24.0K      938    117M    938K   7.32M
   512        1    128K      1K   7.99K      852    106M    852K   6.65M
    2K        2    256K      2K   16.0K    6.47K    829M   6.47M   51.7M
    4K        1    128K      1K   7.99K    4.36K    558M   4.36M   34.8M
   16K        1    128K      1K   7.99K    31.3K   3.91G   31.3M    250M
Total    6.16M    788G    594G    609G    19.8M   2.48T   1.87T   1.92T



If I am calculating this correctly, I am seeing a 1.5 GB deduplication table in RAM, and 9GB on disk. Is that correct?

My system has 24 GB of RAM for that 3.6 TB of storage. I am wondering what you all think about the possibility of running this with dedupe on all the time?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I don't have access to my dedup notes, so I can't validate the numbers/math. But Dedup is one of those things that is dangerous and can have a catastrophic consequence to performance. So unless you expect to reap massive gains (we're talking saving in excess of 100%, or a 2:1 ratio) in disk space I never recommend it. The only time I've seen that kind of ratio was in very specific configurations. And in those configurations the ratio is 10:1+.

I think 24GB of RAM is just much much too small to ever consider dedup. If you want to save lots of storage space, go with gzip-9 compression of the dataset.
 

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
I could be wrong, but I am pretty sure that output is indicating a gain of at least 2:1, if not better. This is also with a minimal set of backups. My point here is that my data is VERY redundant, and that the more of it I write, the more gains I would expect to see from deduplication. Obviously I realize it's dangerous; that's why I am trying to test extensively.

As far as the RAM goes, how much do I need? I have seen everything from 2-10 GB per TB listed as the "optimal" amount, and nobody seems to be able to explain how much I really need or how to calculate that. This is why I am testing it out and checking the size of my table. People seem to throw out the "not enough RAM" statement a lot in the threads I have seen, but never have I read an actual explanation on how to figure out how much I need. If I can get 2 or 3:1 gains by adding RAM, I will strongly consider the cost of RAM, so please give me an explanation.

As far as gzip-9, isn't that a bit performance intensive, especially considering my CPU?

This isn't about saving storage space really; it's about being able to save more backups without expanding space. I can safely run in my environment with only a few backups, but having more would be really convenient for me.
 

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
Also as another note:

At 3.6 TB of data and 24 GB of RAM, I am at 6 GB per TB. How is that considered "low"?
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
never have I read an actual explanation on how to figure out how much I need
This article (titled "How to Determine Memory Requirements for ZFS Deduplication") is the 2nd organic result when I Google "zfs deduplication". The first result is a blog post by Jeff Bonwick which explains that there's no limit on how much you can deduplicate, it's just that the tables will spill into l2arc and eventually to disk, which will be like hitting a wall in terms of performance. The main issue at that point is that you may need to add a lot of RAM before you can even mount the pool again.
 

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
This article (titled "How to Determine Memory Requirements for ZFS Deduplication") is the 2nd organic result when I Google "zfs deduplication". The first result is a blog post by Jeff Bonwick which explains that there's no limit on how much you can deduplicate, it's just that the tables will spill into l2arc and eventually to disk, which will be like hitting a wall in terms of performance. The main issue at that point is that you may need to add a lot of RAM before you can even mount the pool again.

My bad. I googled "FreeNAS Deduplication" in order to get results specific to the software I was using. Thanks for the link despite being snotty about it.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Thanks for the link despite being snotty about it.
Uh, you're welcome?

Here's an observation that may also seem "snotty", but is meant to help.

If you didn't think of Googling "zfs deduplication" instead of "FreeNAS deduplication", you're probably already in over your head. But perhaps you're a really fast learner and will do just fine if you study ZFS hard enough.
 

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
You know, I made an honest mistake. I have been searching these forums so much that I forget to google ZFS rather than FreeNAS. Now that I realize my mistake I can move on.

What I don't understand is why the more experienced members of this forum seem to be so rude. I honestly came here asking for help, and what I got was help with a side of crap attitude and insults. If helping people puts you in such a bad mood that you need to insult them, then maybe you should consider not providing that help.

Anyways, thanks again, and I will make sure not to ask any more stupid questions here.
 

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
This article (titled "How to Determine Memory Requirements for ZFS Deduplication") is the 2nd organic result when I Google "zfs deduplication". The first result is a blog post by Jeff Bonwick which explains that there's no limit on how much you can deduplicate, it's just that the tables will spill into l2arc and eventually to disk, which will be like hitting a wall in terms of performance. The main issue at that point is that you may need to add a lot of RAM before you can even mount the pool again.

SO based on the calculation mentioned in this article I would need (6.02 M Blocks)*(320 Bytes) of memory = 1971 MB of Memory. That seems far below the 'recommended' 5 GB per TB of data. Can anyone explain what I am missing here?

P.S. If you had read my original question you would see that I had already done a similar calculation based on what I have read. Nobody has addressed whether the numbers I am seeing (which seem really low compared to the doomsday warnings) make sense or if I am way off base here.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
what I got was help with a side of crap attitude and insults
When I Google something and the answer comes right up, I can yell "just Google it", or I can post those results. I think the latter is more helpful, and I don't think it's rude.

My assumption that you're not clear on the relationship between FreeNAS, FreeBSD and ZFS seemed reasonable to me based on the content of your posts. If my assumption were correct, I believe suggesting that you may be in over your head would also be helpful, and also not rude. For someone with that level of understanding, attempting something as advanced as deduplication seems ill-advised, and someone needs to be the one to say so.

Apparently my assumption was incorrect, and for that, I apologize.
 

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
Thanks. I can see how you could make that assumption.

On the original topic:

I have been testing a pool with deduplication enabled, and it does seem like I am running into a memory ceiling and causing some serious performance problems. Where I was able to write at 700 Mbps almost constantly, I am now seeing serious spikes and drops in speed. These same spikes appear to be mirrored in the memory usage of the system. My guess is that I am running out of memory and that the DDT is being written to the disk (very slowly), and then picking back up and so on and so forth.

Now the question arises, is it more beneficial to add the ~$800-$1000 worth of memory to use dedupe, or would I be better off getting more space. Keeping in mind that this is just backups (which are extremely redundant), I would guess that in the long run I might do better getting the memory and then adding space as I run out. The several TB of data that I can store will be enough for now as I work on getting the memory installed, and given the nature of my data the RAM may actually go further than adding more storage.

Any of this sound crazy or way off base? Also, how much should I add? The 2 GB for the table suggested by that link seems like a very low base assumption that doesn't really make it to real world performance considerations.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
is it more beneficial to add the ~$800-$1000 worth of memory to use dedupe, or would I be better off getting more space
I can't tell you with any real authority, but my belief is that you can get where you want to be at significantly lower cost by adding storage rather than RAM, and at the same time avoid the kind of pitfalls that deduplication introduces. This is why you'll see people being steered away from deduplication at every opportunity in these forums, unless they have a very special use case.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
By the way, a backup solution that takes care of deduplication before the data reach FreeNAS would make this all moot. Examples include CrashPlan and Arq Backup, no doubt there are others.
 

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
That is true, but the long story short (and it's a long story) is that most of those are not really acceptable in my situation. Also, in my experience (limited though it may be), most backup software that uses deduplication is extremely slow.

Also, I am using VMware ESXi, not backing up clients, so it's a little different than what you may be thinking (if I can be so bold as to assume what you're thinking).
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Also, I am using VMware ESXi, not backing up clients, so it's a little different than what you may be thinking (if I can be so bold as to assume what you're thinking).
Assumption validated ;)

There are block level backup solutions available, maybe one of those would meet your requirements.
 

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
Assumption validated ;)

There are block level backup solutions available, maybe one of those would meet your requirements.

Unfortunately most of those require full versions of ESX, while I am using the free ESXi (I am going as cheap as possible due to company restraints). I liked the idea of getting deduplication for free, but that just seems like a pipe dream.

I know I am going to have to convince the boss to spend some money at some point, but I like to make sure it's the cheapest (decent) option that will get what we want. Just depends on what I find works best.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The output of "zpool list" will tell you the dedup ratio.

Yes, the compression will use CPU cycles. Particularly on the writing side of things. But, generally you'll see better performance with compression than with dedup with all things considered.

The amount of RAM can be astronomical. If you have lots of small block sizes, you can require hundreds of GB of RAM just for 1TB of data. In a worst case scenario you could need as much as 800GB or so of RAM with just 1TB of data! No, I didn't typo that previous statement. Finding out you need 256GB+ of RAM just to mount your zpool is downright frightening, and you probably won't know about this until you actually hit the limit. And when are you likely to find out? When your primary crashes and you start doing stuff on the backup because you're wanting to do a restore or put it into service as the primary. Is that the kind of scenario you want to have happen when your primary system totally goes offline? I don't. I want backups I can trust and expect to have my data readily available without trying to figure out how much RAM you need just to mount the zpool. But the factors that determine how much RAM you need is how many blocks you have (which is a function of the quantity of data you have and the size of the blocks). Anyway, I would never ever go with a dedup'ed backup unless you were in a situation where you understand how the deduping works and nothing "could" go wrong. Generally, this is basically impossible. Especially if you aren't doing something that is actually generating massive amounts of duplicate data without human intervention.

I'm really not sure where you got the idea that @Robert Trevellyan was being snotty. I thought he was being very informative and was very accurate, even citing links for you to read.

To be honest, if you do much searching about dedup around here, you'll know it is also a very very touchy subject. Virtually nobody that isn't a ZFS expert can truly make an informed decision about whether dedup is a good idea for them. It's not as simple as it seems and it has the capacity to be just fine... and then you jump off the cliff without warning. People choosing to use ZFS are wanting it because it is the storage solution to protect and store your data in the safest manner known to man. This conflicts with that philosophy, so naturally anyone showing up arguing that they want to dedup really is asking a group of people that are going to conservatively tell you not to do what you want.

Now, considering everything, can you really argue that you are going with ZFS for its data security while simultaneously arguing that you're cutting corners in the cost arena by choosing to use the rather risky dedup that ZFS offers? I'd argue that you couldn't convince most people that you aren't a bit conflicted with yourself because you're probably saving, at most, about $500 or so in additional hardware. Is that kind of risk really worth that little money? If you were gonna save $300k in hardware I'd be making a different argument. But you aren't. You're likely saving what equates to a house payment or less. If the business model can't bend and accept that kind of (relatively small) flexibility in spending some liquid assets to protect the data, why are you even trying to use ZFS at all? If this is the case, then clearly the business model is so strapped for cash that ZFS is about the least of their worries in the bigger picture.

So I think that things really should be put into perspective, especially when trying to argue for dedup. People generally save very, very little disk space (which is cheaper than ever), but the risk is very real and extremely fatal for the zpool if things go bad.

Here's a quote from the release notes from when FreeNAS was updated and dedup was supported...

ZFS v28 includes deduplication, which can be enabled at the dataset level. The more data you write to a deduplicated volume the more memory it requires, and there is no upper bound on this. When the system starts storing the dedup tables on disk because they no longer fit in RAM, performance craters. There is no way to undedup data once it is deduplicated, simply switching dedup off has NO AFFECT on the existing data. Furthermore, importing an unclean pool can require between 3-5GB of RAM per TB of deduped data, and if the system doesn't have the needed RAM it will panic, with the only solution being adding more RAM or recreating the pool. Think carefully before enabling dedup! Then after thinking about it use compression instead.

To be frank, and at the cost of potentially being accused of being snotty, you've admitted you don't know how much you'd save with gzip-9, but you didn't really take that into consideration from people with far more ZFS experience than you.
It's written in a way that I find hilarious. But it is also very true.
 

dairyengguy

Dabbler
Joined
Jul 17, 2015
Messages
28
Thank you for your well thought out and detailed post. I don't have the time or the energy to respond to all of it, but I appreciate every bit. I have seen all the very cautious warnings about dedupe, but I didn't see (on this forum at least), the why of it. I don't like taking advice without seeing the reasoning behind it, and after all this discussion I am definitely starting to see it more in-depth.

I have not written off the idea of gzip-9; I just haven't had a chance to test it. I was going to try it out today, but I am having other issues with my test pool that I need to work out first before I test the compression vs deduplication further.

Your points about the safety considerations are well taken, and I believe that may convince me to finally drop the idea.

Again, thanks to both of you for being patient with me and helping me understand this a little further. Like I said before, I hate blindly taking advice, and you have allowed me to not do that.
 
Status
Not open for further replies.
Top