What is the exact checksum size overhead?

Bidule0hm · Mar 6, 2015

I'm creating a RAID size calculator and I need to know the overhead size of the checksums.

I've searched but I can't even find the checksum type used by ZFS in FreeNAS.

Does anyone have the value?

zambanini · Mar 6, 2015

42. scnr

Bidule0hm · Mar 6, 2015

Can you expand a bit please?

zambanini · Mar 6, 2015

scnr..sorry could not resist.

The Hitchhiker's Guide to the Galaxy as the "Answer to The Ultimate Question of Life, the Universe, and Everything".

the answer: 42

Bidule0hm · Mar 6, 2015

Ok, I see...

danb35 · Mar 6, 2015

What I've seen usually works out to around 3%. There's a YouTube video of Dru explaining ZFS where she says that 1/64 the pool capacity is used for ZFS internal stuff (metadata, checksums, etc.), but that would only account for about 1.5%.

Bidule0hm · Mar 6, 2015

Thanks ;)

From what I've read (here especially) the 1/64 is for the CoW, I don't know who to believe now

mav@ · Mar 7, 2015

I don't think it is possible to give single right answer about overhead. There are many kinds of metadata in ZFS and so many kinds of overhead. Also many parameters there are variable. For example, if we look on indirection tables, that actually store data block checksums, then first level overhead is 2*128 bytes per block. If the block is 128K and the file is large, then it will be ~0.2%, if the block is 8K, then ~3%. But that is only for indirection tables, while there are also directories, dnodes, free space maps, etc. On the other side ZFS compresses metadata with lz4, that may reduce the final number.

Bidule0hm · Mar 7, 2015

Yeah, ZFS is very simple: http://people.freebsd.org/~gibbs/zfs_doxygenation/html/d1/d4d/structmetaslab.html (and that is just the metaslabs...)

128B? it's not SHA256 used on FreeNAS so 256B? (not sure at all because can't find a reliable source of info)

Ah, it's one checksum per block (two if we count the backup copy), it's not one per 128k. There is any means to know the number of used blocks binned by the size of the blocks?

Any means to know the blocks overhead?

Let's ignore compression, directories, ...

mav@ · Mar 7, 2015

Bidule0hm said:
128B? it's not SHA256 used on FreeNAS so 256B?

Pointers in indirection blocks use 128 bytes per data block. Of those 128 bytes: 48 bytes for up to three block pointers, 32 bytes for checksums (sha256 is 32 bytes long) and 48 bytes for other uses. There are two or three copies of each indirection blocks.
http://www.giis.co.in/Zfs_ondiskformat.pdf

Bidule0hm · Mar 7, 2015

God, I meant 256 bits, bitten by the B/b mistake, shame on me.

Ok, so there is 128 bytes per block of data (which can be 4k to 128k* if I understand correctly how it works) for checksums and block pointers.

But I know that it's a Merkle tree so there is a lot of blocks who contains only block pointers ((2 * N) - 1 for N blocks of data) so what is the size of these blocks? It seems to me that it's a lot of space used just for block pointers, am I wrong with my thoughts on the structure?

*what are the possible sizes of the blocks? it's 128k, 64k, 32k, ... 4k or it's more complex?

mav@ · Mar 7, 2015

Bidule0hm said:
But I know that it's a Merkle tree so there is a lot of blocks who contains only block pointers ((2 * N) - 1 for N blocks of data) so what is the size of these blocks? It seems to me that it's a lot of space used just for block pointers, am I wrong with my thoughts on the structure?

Indirection blocks are indeed joined in a tree with variable depth, but since it is indexed not by hash but by block number there is no holes, unless file itself has holes. And since node of each level points to 1024 child nodes, space occupied by parent nodes is not significant.

There are other trees in ZFS that may have holes: dedup tables, free space tables, etc. But I don't know much about them.

Bidule0hm said:
*what are the possible sizes of the blocks? it's 128k, 64k, 32k, ... 4k or it's more complex?

Block sizes can be set to any power of 2 from 512B to 128KB. Each file has own block size value copied from dataset value during first file write and never change after that. But files that are smaller then that are not occupying whole block, but use less space. Compression also takes place there -- block can be compressed down to vdev ashift (512 or 4K), or even into nothing if after compression data can fit into the pointer space.

Bidule0hm · Mar 8, 2015

Ok, so it's not a binary tree. I'm reassured :)

"Block sizes can be set to any power of 2 from 512B to 128KB" there is blocks smaller than 4kB even if ashift=12?

"Each file has own block size value copied from dataset value" dataset value? can you expand on that?

"if after compression data can fit into the pointer space." The pointer space isn't reserved for the blocks pointers? and even if the data can be here, it's still a block so the data space of the block is empty, no?

Thanks for your help to understanding how this works ;)

mav@ · Mar 8, 2015

Bidule0hm said:
"Block sizes can be set to any power of 2 from 512B to 128KB" there is blocks smaller than 4kB even if ashift=12?

ZFS allows to do that, but it would be very space-inefficient. Same way ZFS allows 8K blocks on RAIDZ3, where as result for each 8K data will be 12K of redundancy even if vdev is much wider. It is supposed that all configuration decisions should be reasonable. :)

Bidule0hm said:
"Each file has own block size value copied from dataset value" dataset value? can you expand on that?

Each dnode (file of zvol) has block size stored in its metadata. ZFS uses that shift to convert file offsets into indirection pointer offsets. That value can not be changed after dnode already has some data/pointers. So if block size (recordsize) is changed for existing dataset, then new value will be used only for new files, while existing one remain as-is. For zvols block size simply can not be changed after the first write.

Bidule0hm said:
"if after compression data can fit into the pointer space." The pointer space isn't reserved for the blocks pointers? and even if the data can be here, it's still a block so the data space of the block is empty, no?

Originally it was reserved, but some time ago there was added a new pool feature called embedded_data. If this feature is enabled and data can be compressed down to 112 bytes, then they are stored directly inside the pointer. See zpool-features man page.

Bidule0hm · Mar 8, 2015

Thanks for the details, it's clearer now ;)

So obviously we can't calculate a precise percentage of the checksums or other things overhead because the size of blocks isn't fixed. There is any means to know the number of the used blocks binned by the size of the blocks on a pool and/or dataset?

cyberjock · Mar 8, 2015

Yes. If you unmount the pool and do some "zfs debugging" with zdb, it is doable. Basically, what you want to do with calculating overhead is effectively impossible. I've seen pools that had very small overhead (like less than 5%) and I've seen pools that exceeded 25%. Depends on your use-case and how you actually use it. If you are doing iSCSI, expect the higher end. ;)

Bidule0hm · Mar 8, 2015

Well, I may not want unmount my main and only one pool with live data on it to do this :)

You talk about the total overhead including the blocks overhead, directories overhead, ...?

So in short if you have some big files and a few directories you have roughly 5% of total overhead and if you have plenty of very small files scattered in many directories you have roughly 25%, or I'm wrong?

cyberjock · Mar 8, 2015

I'm talking overhead for pretty much everything. So if I have a 500GB file and the pool is exactly 1024GB formatted empty capacity (not really possible in the real world, but go with it) and I have only 450GB of free space, then the overhead in my book is 74GB (or approximately 14.8%).

Bidule0hm · Mar 8, 2015

Ok, thanks ;)

SirMaster · Mar 9, 2015

ZFS always reserves 1/64th of the pool for all the metadata so you can put that into the calculator.

However there are other overheads depending on the raidz disk layout.

Free space calculation is done with the assumption of 128k block size. Each block is completely independent so sector aligned and no parity shared between blocks. This creates overhead unless the number of disks minus raidz level is a power of two. Above that is allocation overhead where each block (together with parity) is padded to occupy the multiple of raidz level plus 1 (sectors). Zero overhead from both happens at raidz1 with 2, 3, 5, 9 and 17 disks and raidz2 with 3, 6 or 18 disks

For example, high overhead with 10 disks in raidz2 for is because of allocation overhead and can be calculated as follows:

128k / 4k = 32 sectors,
32 sectors / 8 data disks = 4 sectors per disk,
4 sectors per disk * (8 data disks + 2 parity disks) = 40 sectors.
40 is not a multiple of 3 so 2 sector padding is added. (5% overhead)

Here is what the total amount of overhead in TiB looks like for 6-18 3TB disks in RAIDZ2.

6: 0.2602885812520981
7: 1.1858475767076015
8: 1.149622268974781
9: 0.7288754601031542
10: 1.3953630803152919
11: 2.061850775964558
12: 2.915792594663799
13: 1.5491229379549623
14: 2.056995471008122
15: 2.5648680040612817
16: 3.0727405650541186
17: 3.5806130981072783
18: 0.7912140190601349

Also realize that this is just for ashift=12. If you were to use ashift=9 you would see much less overhead, but performance would suffer and is not recommended.

If you want to see more examples of what the overhead ends up as you can do so with thinly provisioned disks in a VM.

Important Announcement for the TrueNAS Community.

What is the exact checksum size overhead?

Server Electronics Sorcerer

Patron

Server Electronics Sorcerer

Patron

Server Electronics Sorcerer

Hall of Famer

Server Electronics Sorcerer

iXsystems

Server Electronics Sorcerer

iXsystems

Server Electronics Sorcerer

iXsystems

Server Electronics Sorcerer

iXsystems

Server Electronics Sorcerer

Inactive Account

Server Electronics Sorcerer

Inactive Account

Server Electronics Sorcerer

Patron

Similar threads