Files Hash database on Truenas

raigan21

Cadet
Joined
Aug 16, 2019
Messages
8
Hello, community,

so I was checking some videos and documents on how dedupe works in Truenas correct me if I am wrong but from what I could understand, Truenas creates a table where he puts each file hash and later on compares new file hash to this table to identify if there are duplicates or not.

Now I don't want to use the dedupe tool because it is really resource-consuming and I have a lot of unique files in my system but I'm still interested in using that hash per file that the function describes to compare maybe if there are dups every month or just compare with external storages to do parity checks.

is there any way to get those hashes directly from Truenas, is that saved in some database I can query?
or do you guys have any suggestions on how should approach this?

the idea is to reduce the time to get that hashes, right now it takes days for a separate machine to calculate each hash one by one because we are talking of hundreds of files in the range of 30 or 300 GB.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
correct me if I am wrong
You are, and rather badly, I'm afraid. ZFS deduplication isn't file-based, and thus ZFS isn't storing checksums of files anywhere. Instead, it stores checksums of blocks as a part of its routine operation. Deduplication, in effect, loads that table of checksums into RAM, and uses that to determine how to write data to disk. There's a good bit of nuance there that others may correct me on, but that's the gist of it. I don't think there's anything that's part of ZFS (or of TrueNAS, really) that's going to do what you're asking for, unless it's using the hashing tools there (sha256, etc.) to generate the hashes once the files are on the system.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
To add in what @danb35 said, RSync program will perform some of what you want.
  • If the file has the same date & time on the source as the destination, no difference
  • If the file has different date & time, then it can perform block checksums and only copy different blocks across the network
  • If you don't trust date and time, you can tell it to perform the block checksum check always
And if you don't want RSync to copy the blocks / files that are different, it can simply made to list them.

Their are a TON of options, like to make backups on differences, (I've see the option, have not used it), or to copy extended attributes or access control lists.


To be clear, RSync does not read the pre-computed ZFS record level checksums. RSync does it's own blocking and check-summing. And even if RSync could read the ZFS record level checksums, sometimes they would be worthless between 2 servers. This is because you can select different checksum algorithms AND have different record sizes between 2 servers.

There are 3 things that affect ZFS block checksums:
  • Checksum algorithms, (ZFS supports 6 at present)
  • Record size, (ZFS supports 512 to 1MB)
  • Underlying pool layout which affects the width of ZFS blocks being written
All 3 of which can be different between 2 different ZFS servers.
 
Joined
Oct 22, 2019
Messages
3,641
You can also use a tool like Czkawka on the client side, which keeps a cache of known hashes, so that each subsequent run doesn't rehash every single file.

It essentially works like this:
  1. It creates a list of files with their location, filesizes, and modification timestamps
  2. If two or more files have the same exact filesize, it calculates a fast a "partial hash" of the beginning of the files
    • If the "partial hashes" differ, than it does not proceed any further (since the files are obviously different, there's no need to waste CPU in calculating the entire file's hash)
  3. If the "partial hashes" are the same, it will go ahead and calculate the hash of the entire file
  4. These "partial hashes" and full hashes are stored in a cache file
  5. The next time you run the application, it will repeat steps 1 - 3, however during steps 2 and 3 it will use any cached hashes (as long as the file modification hasn't changed since the last time). This greatly saves CPU and time!
 
Top