Let's hope we don't need to dedupe terabytes of small-ish files, say around 8k each. ;)
That runs into one of the fundamental storage issues of the past decade or two: we're still largely limited by performance of hard drives, especially seek.
In 1994, Seagate launched the Barracuda ST32550N, a 2GB speed demon with 8ms average seek, and I want to say about 4MBytes/sec read.
In 2012, HGST offers the Ultrastar 7K4000, a 4TB speed demon with 8ms average seek, and 160MBytes/sec read.
Now, a few points to ponder.
The 7K4000 is 2000 times the capacity of the 32550.
With the 32550, using straight math, one can store about 250,000 8K files. Assuming your drive is capable of 200 seeks per second, and assuming that each read or write can be represented in a single seek, that means that reading (or writing or whatever) all of those takes 1250 seconds, or about a third of an hour.
With the 7K4000, using straight math, one can store about 500,000,000 8K files. Same assumptions, all of those takes 29 *DAYS*.
There are of course other factors involved, but this is very fundamental: as the size of a hard drive based storage system goes up, its ability to store and retrieve things becomes impaired if the size of those things is not also increasing. There are some ways to reduce the impact of this, of course, such as with ZFS, more memory and more L2ARC can have significant impact on certain types of workloads.
However, if you are hoping to fill a 70TB storage system with 8K files, I hope you've planned for it to take awhile... quite awhile...
Hmm, artificially. I would say deliberately, but yes given the RAM requirements this is obviously a poor choice for our theoretical 70TB of unique data pool.
I can see justification for calling it either way.
You can make use of L2ARC during import. The problem is it's going to be empty and reading in the DDT turns into lots of random reads, joy.
It isn't smart enough to accelerate the process? That would be horrifying. I honestly haven't looked at much of the dedup code in sufficient detail to know how it handles this. Given sufficient free resources, though, it ought to be able to prefetch the entire DDT and populate ARC/L2ARC from it, at least as an option!
Calculating the checksum for an 8k record will be faster than a 128k record.
But calculating a checksum for a 128K record and then doing one lookup is going to be about one sixteenth the speed of calculating the checksum for sixteen 8K records and doing those sixteen lookups; SATA/SAS IOPS are going to be awesomely slow and will dominate the time requirements. I wouldn't expect to be able to spot the difference between calculating the checksum on an 8K record and a 128K record once you involve that lookup. On the other hand, if you can fit the DDT all in RAM, yay, that would be interesting and worth thinking about, but we already decided that a 3.5TB DDT in core was mostly impractical on current systems, I think?
I thought SSDs can achieve greater IOPS at smaller record sizes or is that just for the older ones? Certainly with the fastest PCIe based ones, you will slow down due to the transaction overhead with smaller record sizes.
I would imagine it depends in part on the technology used within the SSD. There are some SSDs where the speed differentials between capacities suggest that the capacities are being increased through the addition of banks, in which case, that suggests the potential for better IOPS on higher capacity models (and for devices such as the 320 that appears to be reflected in the specs, see
write IOPS table). However, most current generations of SSD appear to be utilizing 4K pages, or at least, I haven't noticed anything to suggest otherwise. So I would expect SATA transaction overhead could be a significant factor in performance on non-PCIe units...
I hope we're not running this on an Atom. Properly designed we should have sufficient CPU, memory/bus bandwidth, etc, but with such large systems things can get
complicated. No doubt there any number pathological cases to account for. If we are close to some of them, then yes 8k records may very well expose such.
Heh, here's an
interesting post, requiring the DDT to be in the L2ARC & not ever in the ARC, would significantly reduce the memory requirements. Perhaps it will even happen before
BP Re-write is done.
Lots of room for improvement in ZFS, as good as it is.