Anyone using Dedup?

firepowr · Jan 5, 2013

I've read about the concern about enabling the dedup option. It seems many are against using it. But from reading the wiki most of the concern is based on having limited RAM.

I'm very interested in using Dedupe on my freenas server. My server currently has 16GB of ram with 2TB of storage. With that amount of RAM it should be more than enough to hold the dedupe tables and it should avoid any of concerns that were mentioned.

Anyone here in the forums use the dedup function?

Another quick question, if you do use dedup have you tried periodic snapshots and replication on that volume?

ProtoSD · Jan 5, 2013

Have you seen this link from one of the developers?

http://forums.freenas.org/showthrea...uture-of-FreeNAS&p=31805&viewfull=1#post31805

I'm curious to hear if other people have tried it also, possibly only on a dataset level. I'd definitely would be cautious about playing with it without having a backup.

I found this post interesting:

https://blogs.oracle.com/bonwick/entry/zfs_dedup

firepowr · Jan 5, 2013

Thanks for the links.

protosd said:
Have you seen this link from one of the developers?

http://forums.freenas.org/showthrea...uture-of-FreeNAS&p=31805&viewfull=1#post31805

I'm curious to hear if other people have tried it also, possibly only on a dataset level. I'd definitely would be cautious about playing with it without having a backup.

I found this post interesting:

https://blogs.oracle.com/bonwick/entry/zfs_dedup

cyberjock · Jan 6, 2013

I'd love to hear too. I'd bet there are an extremely few number of people using it considering how cheap hard drive space is versus the potential consequences if you don't have enough RAM.

I'd think for home users we wouldn't do it because we want maximum storage space without spending $2k+ on RAM alone.

For business users I'd imagine they would have a backup location that may not support dedup so any "savings" in disk space is lost on the backup. If they backup location is an identical FreeNAS(or other system that supports ZFS with dedup) they can probably afford to buy the total amount of disk space they want anyway. So why bother with dedup with the potential devastating consequences if things go bad.

Quite honestly, its a feature I expect will be used in very very very few locations on the planet.

In my head I have the opinion that the only people that will use dedup will be:

1. Bada** mofos that don't need no "steenkin' forum" for support or
2. Dumba**es and really don't know what they're getting themselves into.

LOL.

firepowr · Jan 6, 2013

For the most part those two points are true. But I was thinking of implementing an off site DR backup. And it would be easier for me to pitch something that is in expensive as freenas rather than a datadomain appliance.

With the spare servers that we have and the dedup feature freenas has I can probably implement it without much cost, if any.

I think many aren't using dedupe because of the lack of any real world testing or people just aren't talking about it. Anyhow, I'm going to run a test setup with some spare hardware and see if I will run into any of those issues.

My hypothesis is that if my Ram amount is 8gb per 1tb of storage I should be fine. I'll report back with my findings in a week or two.

noobsauce80 said:
I'd love to hear too. I'd bet there are an extremely few number of people using it considering how cheap hard drive space is versus the potential consequences if you don't have enough RAM.

I'd think for home users we wouldn't do it because we want maximum storage space without spending $2k+ on RAM alone.

For business users I'd imagine they would have a backup location that may not support dedup so any "savings" in disk space is lost on the backup. If they backup location is an identical FreeNAS(or other system that supports ZFS with dedup) they can probably afford to buy the total amount of disk space they want anyway. So why bother with dedup with the potential devastating consequences if things go bad.

Quite honestly, its a feature I expect will be used in very very very few locations on the planet.

In my head I have the opinion that the only people that will use dedup will be:

1. Bada** mofos that don't need no "steenkin' forum" for support or
2. Dumba**es and really don't know what they're getting themselves into.

LOL.

louisk · Jan 6, 2013

The biggest issue I see is that it limits the ability to grow storage space. It's trivial (price wise, something like $15k) to build a FreeNAS machine that is 70+ TB. It is not trivial to get sufficient RAM to use dedupe on this (if memory serves, something like 3.5TB).

jgreco · Jan 6, 2013

Assuming sufficient RAM? Sure, I guess.

Go ahead and check it out. Do a "zdb -S yourpool". Take the total allocated blocks. Multiply by 320. Then extrapolate based on how full your pool is. For example, the pool on this overly full 4 x 3TB RAIDZ2 box here shows 35.5M allocated total, 89% capacity, so that works out to 12.8GB needed for DDT... that's probably within the realm of possibility because the host platform does have 32GB and also SSD L2ARC.

I expect that dedupe would have very specific use cases where it worked out very well. However, here in the free FreeNAS community, you're more likely to find people using NAS for things like media storage, and my suspicion is that you're less likely to find people actively using it for anything significant who are also active on the forum.

jgreco · Jan 6, 2013

louisk said:
The biggest issue I see is that it limits the ability to grow storage space. It's trivial (price wise, something like $15k) to build a FreeNAS machine that is 70+ TB. It is not trivial to get sufficient RAM to use dedupe on this (if memory serves, something like 3.5TB).

Is it really that much space for the DDT? Maybe I've been seeing the wrong sizing guidelines.

Still, with 512GB SSD's priced out around $450, you can stick a 3.5TB DDT in only 7 SSD's of L2ARC (plus an a**load of RAM for the ARC needed to support the L2ARC, really that gets a bit dicey) which is really not that awful expensive at $3000; if your FreeNAS machine is $15K and you can suddenly store "twice" as much in it, you've just saved yourself $12K.

For 70TB, though, I would have expected DDT space required to be much less than 3.5TB. That just seems unreasonably large.

paleoN · Jan 7, 2013

jgreco said:
Is it really that much space for the DDT? Maybe I've been seeing the wrong sizing guidelines.

Possibly, the bottom of this blog entry. Note Roch used 200 bytes for his DDTs vs 320. 8K record size at 320 bytes looks to come out to about 3.5TB. As you said you can throw it on some L2ARCs and have less RAM.

jgreco said:
For 70TB, though, I would have expected DDT space required to be much less than 3.5TB. That just seems unreasonably large.

If you are using large/default record sizes with larger files, sure. Of course you can likely increase your dedup ratio by using smaller record sizes, i.e. 8k, at the cost of greater DDT space.

firepowr said:
My hypothesis is that if my Ram amount is 8gb per 1tb of storage I should be fine.

Likely with default 128k record size and presumably large backup files. Do as jgreco suggested and actually look to see what your dedupe ratio is. Sun/Oracle's recommendation is if it's less than 2, to leave it turned off.

jgreco · Jan 8, 2013

paleoN said:
8K record size at 320 bytes looks to come out to about 3.5TB.

However, FreeNAS defaults to 128k record size.. so we're not looking at "DDT required" so much as we are at "it's possible to make bad choices that bloat the DDT".

Someone from ixSystems posted a fairly negative set of comments about dedup and excessive memory requirements causing problems when importing the pool, so artificially bloating the DDT size requirements through the use of smaller record sizes seems like you'd want to know what you were doing. Especially if it might turn into a nasty situation where actual RAM (rather than L2ARC) would be needed in order to bootstrap the pool. Systems capable of 1TB or more of RAM are the exception rather than the rule as previously noted.

The other problem with smaller record sizes is that performance is going to drop off substantially. If I calculate a checksum on a 128k record and then have to do an L2ARC SSD lookup on it, that's not spectacularly fast. However, using an 8k record, for the same amount of data, that is 16 separate L2ARC SSD lookups...

So if you have a system with a 10gig interface, that can write up to about 1GB/sec of data to a pool realistically (assuming a sufficiently fast pool). However, as far as I recall, dedup is a synchronous process in the output stream, meaning dedup processing can slow things down. At 128KB records, that is only ~8k lookups per second for the DDT (is that right? sounds a bit optimistic but I don't see any major error) but at 8KB records, that's ~128k lookups per second, which is a lot more stressy.

Turning down the record size certainly provides more potential opportunity for dedup hits, but also represents a significant increase in system stressors.

The good news is that it is now possible to buy systems with 384GB of RAM at an approximately reasonable price (approx. $3000 for 24 sticks of 16GB at $125/ea). And 768GB and 1TB systems are *possible* if you really need it.

So we're finally finding out why someone might need more than 640KB.

paleoN · Jan 9, 2013

jgreco said:
However, FreeNAS defaults to 128k record size.. so we're not looking at "DDT required" so much as we are at "it's possible to make bad choices that bloat the DDT".

Let's hope we don't need to dedupe terabytes of small-ish files, say around 8k each. ;)

jgreco said:
Someone from ixSystems posted a fairly negative set of comments about dedup and excessive memory requirements causing problems when importing the pool, so artificially bloating the DDT size requirements through the use of smaller record sizes seems like you'd want to know what you were doing.

Hmm, artificially. I would say deliberately, but yes given the RAM requirements this is obviously a poor choice for our theoretical 70TB of unique data pool.

You can make use of L2ARC during import. The problem is it's going to be empty and reading in the DDT turns into lots of random reads, joy.

jgreco said:
The other problem with smaller record sizes is that performance is going to drop off substantially. If I calculate a checksum on a 128k record and then have to do an L2ARC SSD lookup on it, that's not spectacularly fast. However, using an 8k record, for the same amount of data, that is 16 separate L2ARC SSD lookups...

Calculating the checksum for an 8k record will be faster than a 128k record. I thought SSDs can achieve greater IOPS at smaller record sizes or is that just for the older ones? Certainly with the fastest PCIe based ones, you will slow down due to the transaction overhead with smaller record sizes.

jgreco said:
Turning down the record size certainly provides more potential opportunity for dedup hits, but also represents a significant increase in system stressors.

I hope we're not running this on an Atom. Properly designed we should have sufficient CPU, memory/bus bandwidth, etc, but with such large systems things can get complicated. No doubt there any number pathological cases to account for. If we are close to some of them, then yes 8k records may very well expose such.

jgreco said:
So we're finally finding out why someone might need more than 640KB.

Heh, here's an interesting post, requiring the DDT to be in the L2ARC & not ever in the ARC, would significantly reduce the memory requirements. Perhaps it will even happen before BP Re-write is done.

jgreco · Jan 10, 2013

paleoN said:
Let's hope we don't need to dedupe terabytes of small-ish files, say around 8k each. ;)

That runs into one of the fundamental storage issues of the past decade or two: we're still largely limited by performance of hard drives, especially seek.

In 1994, Seagate launched the Barracuda ST32550N, a 2GB speed demon with 8ms average seek, and I want to say about 4MBytes/sec read.

In 2012, HGST offers the Ultrastar 7K4000, a 4TB speed demon with 8ms average seek, and 160MBytes/sec read.

Now, a few points to ponder.

The 7K4000 is 2000 times the capacity of the 32550.

With the 32550, using straight math, one can store about 250,000 8K files. Assuming your drive is capable of 200 seeks per second, and assuming that each read or write can be represented in a single seek, that means that reading (or writing or whatever) all of those takes 1250 seconds, or about a third of an hour.

With the 7K4000, using straight math, one can store about 500,000,000 8K files. Same assumptions, all of those takes 29 *DAYS*.

There are of course other factors involved, but this is very fundamental: as the size of a hard drive based storage system goes up, its ability to store and retrieve things becomes impaired if the size of those things is not also increasing. There are some ways to reduce the impact of this, of course, such as with ZFS, more memory and more L2ARC can have significant impact on certain types of workloads.

However, if you are hoping to fill a 70TB storage system with 8K files, I hope you've planned for it to take awhile... quite awhile...

Hmm, artificially. I would say deliberately, but yes given the RAM requirements this is obviously a poor choice for our theoretical 70TB of unique data pool.

I can see justification for calling it either way.

You can make use of L2ARC during import. The problem is it's going to be empty and reading in the DDT turns into lots of random reads, joy.

It isn't smart enough to accelerate the process? That would be horrifying. I honestly haven't looked at much of the dedup code in sufficient detail to know how it handles this. Given sufficient free resources, though, it ought to be able to prefetch the entire DDT and populate ARC/L2ARC from it, at least as an option!

Calculating the checksum for an 8k record will be faster than a 128k record.

But calculating a checksum for a 128K record and then doing one lookup is going to be about one sixteenth the speed of calculating the checksum for sixteen 8K records and doing those sixteen lookups; SATA/SAS IOPS are going to be awesomely slow and will dominate the time requirements. I wouldn't expect to be able to spot the difference between calculating the checksum on an 8K record and a 128K record once you involve that lookup. On the other hand, if you can fit the DDT all in RAM, yay, that would be interesting and worth thinking about, but we already decided that a 3.5TB DDT in core was mostly impractical on current systems, I think?

I thought SSDs can achieve greater IOPS at smaller record sizes or is that just for the older ones? Certainly with the fastest PCIe based ones, you will slow down due to the transaction overhead with smaller record sizes.

I would imagine it depends in part on the technology used within the SSD. There are some SSDs where the speed differentials between capacities suggest that the capacities are being increased through the addition of banks, in which case, that suggests the potential for better IOPS on higher capacity models (and for devices such as the 320 that appears to be reflected in the specs, see write IOPS table). However, most current generations of SSD appear to be utilizing 4K pages, or at least, I haven't noticed anything to suggest otherwise. So I would expect SATA transaction overhead could be a significant factor in performance on non-PCIe units...

I hope we're not running this on an Atom. Properly designed we should have sufficient CPU, memory/bus bandwidth, etc, but with such large systems things can get complicated. No doubt there any number pathological cases to account for. If we are close to some of them, then yes 8k records may very well expose such.

Heh, here's an interesting post, requiring the DDT to be in the L2ARC & not ever in the ARC, would significantly reduce the memory requirements. Perhaps it will even happen before BP Re-write is done.

Lots of room for improvement in ZFS, as good as it is.

paleoN · Jan 10, 2013

jgreco said:
There are of course other factors involved, but this is very fundamental: as the size of a hard drive based storage system goes up, its ability to store and retrieve things becomes impaired if the size of those things is not also increasing.

Yes and the error rate hasn't improved recently either. Give me reliable and cheap SSDs. We'll get to cheap certainly, but I'm a bit concerned about the reliable part.

jgreco said:
However, if you are hoping to fill a 70TB storage system with 8K files, I hope you've planned for it to take awhile... quite awhile...

Ideally we will have built up to the 70TB over time otherwise I don't want to think about it. ZFS would be somewhat interesting with this as long as the writes are async as they would be batched into their respective txgs.

jgreco said:
It isn't smart enough to accelerate the process? That would be horrifying. I honestly haven't looked at much of the dedup code in sufficient detail to know how it handles this. Given sufficient free resources, though, it ought to be able to prefetch the entire DDT and populate ARC/L2ARC from it, at least as an option!

I don't now recall where I saw that reading the DDT turns into a random read workload. Most of the threads with people having problems importing they also have deleted a snapshot or dataset with dedupe in use. Though if accurate I don't see prefetch particularly useful when doing random reads?

jgreco · Jan 11, 2013

paleoN said:
Yes and the error rate hasn't improved recently either. Give me reliable and cheap SSDs. We'll get to cheap certainly, but I'm a bit concerned about the reliable part.

Less than $1/GB for SSD. Already at cheap. SSD gives us fast. ZFS with RAIDZ2 gives us reliable. Careful selection of SSD's can even give you energy efficient. Whoever said "pick two!" ... I say, "have all four."

Really, I think we've finally hit a bit of a plateau in hard drive capacity growth. Sure, there will probably be 5TB'ers by the end of this year, but 3TB'ers were released back in 4Q2010. 250GB'ers were new in 2004; capacity quadrupled in 3 years to see 1TB in 3Q2007; capacity tripled to see 3TB in 4Q2010. Now we're not even seeing doubled in 3 more years. What are people using larger hard drives for? As far as I can tell, unless you're Backblaze or another big-data-but-not-valuable-data player, the vast majority of these newer generation of large hard drives seem to be primarily used to store multimedia. That may suggest that we'll see a softening in the growth rate of spinny media. It also suggests that we'll see SSD price-per-GB continue to crash as people decide they really do want to have their PC contain a SSD, and they want that 500GB-1TB they're used to as an SSD.

Ideally we will have built up to the 70TB over time otherwise I don't want to think about it. ZFS would be somewhat interesting with this as long as the writes are async as they would be batched into their respective txgs.

I had thought about mentioning that, but really it's completely worthless unless you have a special application - possibly backups - where you never need to read the data. Otherwise, reads are usually more frequent than writes, and will kill you.

I don't now recall where I saw that reading the DDT turns into a random read workload. Most of the threads with people having problems importing they also have deleted a snapshot or dataset with dedupe in use. Though if accurate I don't see prefetch particularly useful when doing random reads?

What you actually need is something... let's call it pre_load_ despite what I previously said. Being able to locate metadata quickly is important; ZFS has certainly solved that problem in some manner, but I'm not sure how efficiently (the sheer amount of RAM that ZFS likes may be hiding a less-than-ideal design, I don't really know). However, locality is also important. And here I'm speaking all hypothetically, because I haven't gotten into the code enough to know what's actually going on. But the point is simple to comprehend. The average DDT lookup is going to be a random read. That means that it could be implemented with a metadata block allocated wherever. If there was never any chance of doing something like a preload, then not a big decision for the purposes of this discussion. However, we have things like L2ARC and prefetch to consider. Ideally, a pool import could involve reading the DDT metadata from the pool and immediately populating the L2ARC with it (not through the normal ARC/L2ARC flush process which is slow, but as a specialized part of pool import). This could have the effect of slowing down pool import if not implemented cleverly, but that could still be a win for sites with heavy DDT requirements. This only works well with contiguous metadata, as it won't be practical to seek around the disks millions of times to do an import.

If we instead rely on prefetch, and make sure that prefetch will handle DDT metadata aggressively, that also works well as long as there are contiguous ranges for prefetch to handle. A modern disk I/O is mostly seek, and the difference between a seek+reading 4KB and a seek+reading 1MB will not be particularly noticeable. So, if the DDT metadata is contiguous, you could actually just mostly rely on built-in ZFS mechanisms to handle ARC/L2ARC and populate the in-core/in-L2ARC DDT data rather quickly without doing the pre_load_ I describe above.

Both of these techniques are basically intended to rapidly populate the in-L2ARC DDT. Both are dependent on the DDT data being organized in some manner that can be dealt with in a contiguous manner.

paleoN · Jan 13, 2013

I read up a bit on importing a pool with dedupe enabled. DDT is loaded as needed which during a typical import will be little to none. A typical import should also import as quickly. Which makes sense when you think about it. ZFS doesn't normally go reading large amounts of data in the pool at import.

Most of the problems I've read about have to do with people doing some sort of destroy then the box crashing or they reboot when it becomes unresponsive. However, ZFS remembers that it's doing the destroy and gleefully continues with it during any subsequent import. Which in turn crashes the box or causes the system to become unresponsive. Which continues until they provide adequate hardware to do the import or if the box is still functioning, wait the week or so for it to finish.

cyberjock · Jan 26, 2013

Looks like we may have had our first 'victim' with dedup. :(

http://forums.freenas.org/showthread.php?10995-ZFS-import-runs-out-of-swap-amp-hangs-OS

Important Announcement for the TrueNAS Community.

Anyone using Dedup?

firepowr

Cadet

ProtoSD

MVP

firepowr

Cadet

cyberjock

Inactive Account

firepowr

Cadet

louisk

Patron

jgreco

Resident Grinch

jgreco

Resident Grinch

paleoN

Wizard

jgreco

Resident Grinch

paleoN

Wizard

jgreco

Resident Grinch

paleoN

Wizard

jgreco

Resident Grinch

paleoN

Wizard

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

Anyone using Dedup?

Cadet

MVP

Cadet

Inactive Account

Cadet

Patron

Resident Grinch

Resident Grinch

Wizard

Resident Grinch

Wizard

Resident Grinch

Wizard

Resident Grinch

Wizard

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Anyone using Dedup?"

Similar threads