How much free space needed for a read-only volume?

kbowman

Cadet
Joined
May 16, 2020
Messages
2
I have many hundreds of TB of geophysical data spread across a number of RAIDZ-2 volumes. Once a volume is filled, it is used essentially as read-only storage. Very rarely I need to add small amounts of data to a volume. I use ZFS mainly for its robust data integrity. I do not do de-duplication or snapshots. I scrub each volume quarterly.

The standard recommendation is to keep 10 to 20% free space on a volume. My understanding is that this is due to the write strategy of ZFS. In my case this leaves many 10s of TB of empty space unused.

So how much free space do I really need on a volume that is effectively read-only?

Thanks, Ken
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
So how much free space do I really need on a volume that is effectively read-only?
The real question is how sure are you about this read-only thing... if you're 100%, then you could go to almost 100% (I suppose 99% to leave a small margin of safety) full and never see an issue (other than the warnings in logs and alerts).

As soon as you would try to do something though, you're in deep trouble as CoW means it needs space even to delete.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Welcome to the forums Ken.

I think in this sense you might want to quantify just how much data you're talking about; 10% of a 10TB volume in someone's home will be 1TB, but 10% of a 100TB (or PB-sized volume) is far more substantial. You also mentioned "very rarely I need to add small amounts of data" - are we talking another TB, a few dozen GB, or a hundred MBs?

In the case of this being a truly "write once, read many, overwrite/delete NEVER" archival style workflow, you can indeed push past the "recommended free space" of 10-20%. That is set to attempt to preserve the write performance, as you're exactly correct that ZFS does best when it can write into contiguous free space. Beyond a given threshold (I believe 94% on FreeNAS) it will switch to a "best-fit" or "hole-filling" algorithm, which causes writes to take a lot longer.

Notably, read performance is unaffected by this. And if you're writing large amounts of data sequentially, and never deleting or overwriting, your overall file fragmentation will be minimal; so your streaming reads will still be great.

I'd set up multiple threshold warnings, depending on how big the "small amounts of data" rarely added are. You absolutely never want to reach a true "100% pool full" level, as the transactional nature of ZFS means you always need a small amount of space for updating metadata.

Assuming something like a 250T pool, setting a threshold of 95% for your "initial warning" means you still have (roughly) 12.5T left - enough time to stop the "primary writes" and build up another chunk of storage. At 98%, you have 5T free - this should be more than enough for any of the "small updates" you mentioned, as well as housekeeping and deletes. At 99% (2.5T) you probably should stop entirely to avoid getting into a bind.

And since we're talking about "hundreds of TB of data" I'd be remiss if I didn't point out what you probably already know; "ZFS and RAID are not a backup strategy" - make sure you have some way to retain this outside of the disks, in case of a catastrophic failure. With that volume, something like LTO tape is a good option.

Further thought; if you're more concerned with space than initial ingest speed, you could also do a test run to see if you get significantly better results from using gzip-9 compression vs. LZ4 on your data.
 
Last edited:

kbowman

Cadet
Joined
May 16, 2020
Messages
2
Welcome to the forums Ken.

I think in this sense you might want to quantify just how much data you're talking about; 10% of a 10TB volume in someone's home will be 1TB, but 10% of a 100TB (or PB-sized volume) is far more substantial. You also mentioned "very rarely I need to add small amounts of data" - are we talking another TB, a few dozen GB, or a hundred MBs?

Thanks very much for both of the comments.

I currently have 8 "full" volumes with an average size of 60 TB (480 TB total). About 60 TB total (~12%) of those volumes is unused. These volumes are used only for the data archive.

The data consist of large numbers of relative small files (MB) organized by observing site and time. I have taken care to add the data sequentially when writing to the archive, as that is the way in which they are usually accessed. There should be very little file fragmentation.

I have added new observing sites to the archive twice. Once a larger chunk (~20%) and more recently a smaller chunk (<1%). When doing this, I shifted data between volumes as necessary to create sufficient space and to keep data continuous in time. The likelihood of adding more observing sites is very low, but if I do, I can shift data among the volumes to create the necessary space.

I continue to add data to the archive. In a few years it will probably reach 700 - 800 TB, with the free space on "full" volumes approaching 100 TB.

This is an academic research project, so cost is a major constraint. If I can use part of that 100 TB without significantly impacting performance, it will be quite beneficial. Reducing the free space to 2%, or even 5%, which would still leave >1 TB of free space per volume.

In the case of this being a truly "write once, read many, overwrite/delete NEVER" archival style workflow, you can indeed push past the "recommended free space" of 10-20%. That is set to attempt to preserve the write performance, as you're exactly correct that ZFS does best when it can write into contiguous free space. Beyond a given threshold (I believe 94% on FreeNAS) it will switch to a "best-fit" or "hole-filling" algorithm, which causes writes to take a lot longer.

Notably, read performance is unaffected by this. And if you're writing large amounts of data sequentially, and never deleting or overwriting, your overall file fragmentation will be minimal; so your streaming reads will still be great.

I'd set up multiple threshold warnings, depending on how big the "small amounts of data" rarely added are. You absolutely never want to reach a true "100% pool full" level, as the transactional nature of ZFS means you always need a small amount of space for updating metadata.

Assuming something like a 250T pool, setting a threshold of 95% for your "initial warning" means you still have (roughly) 12.5T left - enough time to stop the "primary writes" and build up another chunk of storage. At 98%, you have 5T free - this should be more than enough for any of the "small updates" you mentioned, as well as housekeeping and deletes. At 99% (2.5T) you probably should stop entirely to avoid getting into a bind.

And since we're talking about "hundreds of TB of data" I'd be remiss if I didn't point out what you probably already know; "ZFS and RAID are not a backup strategy" - make sure you have some way to retain this outside of the disks, in case of a catastrophic failure. With that volume, something like LTO tape is a good option.

I am well aware of that, thanks. This is not the only copy of the data, although it does take considerable time and effort to download it. Maintaining two copies of the data locally is not financially feasible. If I lose a chunk of the data, I will have to re-download it.

Further thought; if you're more concerned with space than initial ingest speed, you could also do a test run to see if you get significantly better results from using gzip-9 compression vs. LZ4 on your data.

The data use a complex internal storage format. Older files are compressed with bzip2 as whole files. Newer files use bzip2 internally to compress blocks of data.

Uncompressed, the data archive would be many PBs.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
One trick to prevent 100% full problems, is to setup a dataset that you simply use for a reservation of space;

zfs create -o reservation=10M POOL/res_space

Then, if you accidentally fill the pool up, you can remove the reservation of your empty dataset. That frees up 10MBytes in this example. They exact size to use is probably debatable. And more modern OpenZFS have other tricks around the free space issue.
 
Top