Unpredictably high zvol used (ZFS bug?) Would file-based iSCSI extents have predictable total usage?

jimp · Feb 24, 2014

tl;dr
(but please read, I suspect this is a ZFS bug that FreeNAS 9.2.1 is working around.)

Will file-based iSCSI extents have unpredictable disk usage? Will any more diskspace be used other than the size of the extent?
Why does a zvol report "usedbydataset" at all? That's not only for datasets?
Is this a ZFS bug in the vzol code, or is this blocksize-dependent usage-creep behavior intended by the ZFS designers?

Full Report

I have been trying to figure out where all the space used by my 250G zvol is going. The zvol is used with iSCSI, and while the initiator has formatted it for 250G, zfs reports 405G are actually in use with only 164G logically used.

Build Specs

Code:

Build	FreeNAS-9.2.0-RELEASE-x64 (ab098f4)
Platform	Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
Memory	16336MB

pool1 is a mirror of two 1TB disks (4k, ashift=12).
pool1/iscsi is a dataset with nothing in it. (Maybe file-based extents later.)
pool1/iscsi/hdd0 is a zvol with the (recently learned) default 8k blocksize used on creation. A XenServer LVM-based Storage Repository lives here.

Code:

# zfs get all pool1/iscsi/hdd0
NAME              PROPERTY              VALUE                  SOURCE
pool1/iscsi/hdd0  type                  volume                 -
pool1/iscsi/hdd0  creation              Thu Feb  6 17:32 2014  -
pool1/iscsi/hdd0  used                  405G                   -
pool1/iscsi/hdd0  available             766G                   -
pool1/iscsi/hdd0  referenced            133G                   -
pool1/iscsi/hdd0  compressratio         1.12x                  -
pool1/iscsi/hdd0  reservation           none                   default
pool1/iscsi/hdd0  volsize               250G                   local
pool1/iscsi/hdd0  volblocksize          8K                     -
pool1/iscsi/hdd0  checksum              on                     default
pool1/iscsi/hdd0  compression           lz4                    inherited from pool1-san/iscsi
pool1/iscsi/hdd0  readonly              off                    default
pool1/iscsi/hdd0  copies                1                      default
pool1/iscsi/hdd0  refreservation        258G                   local
pool1/iscsi/hdd0  primarycache          all                    default
pool1/iscsi/hdd0  secondarycache        all                    default
pool1/iscsi/hdd0  usedbysnapshots       14.7G                  -
pool1/iscsi/hdd0  usedbydataset         133G                   -
pool1/iscsi/hdd0  usedbychildren        0                      -
pool1/iscsi/hdd0  usedbyrefreservation  257G                   -
pool1/iscsi/hdd0  logbias               latency                default
pool1/iscsi/hdd0  dedup                 off                    default
pool1/iscsi/hdd0  mlslabel                                     -
pool1/iscsi/hdd0  sync                  standard               default
pool1/iscsi/hdd0  refcompressratio      1.10x                  -
pool1/iscsi/hdd0  written               549M                   -
pool1/iscsi/hdd0  logicalused           164G                   -
pool1/iscsi/hdd0  logicalreferenced     146G                   -

The space usage is as follows: 257G usedbyrefreservation (250G reserved + 7G for metadata), 133G usedbydataset, and 14.7G usedbysnapshots = 405G used. It does add up. But why do I have any "usedbydataset" when this is not a dataset?

I have found other forum topics (here and on ServerFault) and a resolved bug report discussing the problem occurs when using smaller block sizes and disks with 4k sectors, leading to poor block utilitization and therefore more blocks needed to store the data. The only explanation I have found ([OpenIndiana-discuss] Inefficient zvol space usage on 4k drives) doesn't convince me this is expected behavior. Or if it is by design, it hardly feels sane/predictable. For example, that thread discusses how many raidz2 writes are actually being made, counting it toward zvol usage, but isn't the zdev supposed to logically present a single device that handles all of those details internally, leaving the zvol oblivious to the physical details?

With everything I just outlined above, I am really not sure how to work with a zvol and get a predictable result. I cannot properly manage a SAN where I allocate 250G to XenServer via iSCSI but the reality is I have 405G allocated (and counting) out of 1TB. Even my zpool vs zfs allocations do not add up.

Code:

# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
pool1   928G   147G   781G    15%  1.00x  ONLINE  /mnt
# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
pool1              405G   508G   152K  /mnt/pool1
pool1/iscsi        405G   508G   144K  /mnt/pool1/iscsi
pool1/iscsi/hdd0   405G   766G   133G  -

(I would expect the zfs filesystem to report 257G + snapshots, uncompressed, but that leaves the 133G usedbydataset unaccounted for still.)

I would really appreciate some advice about how to properly optimize zvol usage so if a zvol has 257G reserved, it will not use more than that (snapshots are separate, and should be). The advice circulating, which I think used as a resolution for bug report #2383, is to minimize the problem with a larger block size. It should not be acceptable to have 257G reserved but have 275G "real" (just guessing) by using a 32k block size. This is like having sparse zvol behavior on a non-sparse volume--you cannot accurately predict when it's time to add new disks* without frequent monitoring.

Unless someone can help me understand zvols better, and that this isn't really a ZFS bug I am reporting in the wrong place (at this point, all of my testing and analysis concludes it's a bug), I am planning to use file-based extents. Those will not have unpredictable total disk usage since the "blocksize" is a zvol issue, right?

(* I have read that it's best practice to leave 20% or more free in a pool for best performance. To be clear, I'm not trying to fill up the pool to the last byte. I'm trying to know if I allocate 80% today that it will stay allocated at 80% and not creep up to 100% over time, proper snapshot management withstanding.)

cyberjock · Feb 24, 2014

There's alot of factors that affect the actual disk usage:

1. zvols allocate all of their disk space immediately on creation. So the second you created it, it would be 250GB.
2. ZFS has optimized writing and is a CoW file system. Because of how this works, if you have a write that needs to be performed that is 1KB in size to a RAIDZ2 with an ashift=12 then you'll actually use 12KB of data. This is because your smallest allocation is 4KB plus you have to store 2 copies of parity data of the same size. Now, if you had to write a 1GB file all in one go, then you'd probably use something like 1.0001GB of storage. But, it gets better....
3. All blocks are checksummed, right? Well, that takes up space too! And block sizes can range from 4KB to 128KB. Bigger block sizes give you more data per checksum while smaller blocks give you less data per checksum.
4. iSCSI is recommend you keep at 50% because a CoW filesystem WILL fragment. And since there is no defrag tool for ZFS, performance is pretty much guaranteed to drop over time. 80% is horribly "overfilled" for a pool that is using iSCSI.
5. If you choose to use a zvol based iSCSI extent over a file based, there are more inefficiencies added. zvols are fixed sizes for strip size I believe. Choose a small size and you hurt large write performance and you use alot more disk space. Choose a large size and you hurt smal write performance, and you potentially save lots of disk space. File based are prefered for performance and simplicity for many of these reasons. It's also super easy to help optimize your iSCSI extent. You simply shutdown your iSCSI service, cp oldextent newextent, delete oldextent, mv oldextent newextent and you've regained the lost space.

There's other things at work too, but unless you have deep level understanding of ZFS, they aren't going to make sense to you.

I'd recommend, just based on the info provided, you keep your pool 50% full and move from a zvol to a file based extent. You can expect the iSCSI extents to always consume more than you expect if you use ZFS. ZFS is fairly inefficient with disk space unless the administrator does manual optimization of settings to keep things in check.

Check out this cool bug ticket from one of our moderators showing ZFS space consumption. There's plenty of tickets on this issue. There's one ticket(I couldn't find it though.. ) where someone wrote like 250GB of data and proved that you could use over 2TB of storage space because of this!

It's something you have to accept if you want ZFS. :(

aufalien · Aug 12, 2014

Stoked I found this discussion, awesome info.

tciweb · Jul 16, 2015

cyberjock said:
It's also super easy to help optimize your iSCSI extent. You simply shutdown your iSCSI service, cp oldextent newextent, delete oldextent, mv oldextent newextent and you've regained the lost space.

Is this right? Shouldn't the last part be mv newextent oldextent ?

Important Announcement for the TrueNAS Community.

Unpredictably high zvol used (ZFS bug?) Would file-based iSCSI extents have predictable total usage?

jimp

Dabbler

cyberjock

Inactive Account

aufalien

Patron

tciweb

Cadet

Similar threads