Small file writes causing high disk writes

Status
Not open for further replies.

benqrn

Cadet
Joined
May 11, 2017
Messages
5
I am seeing an issue where occasional small file writes cause a much larger disk write, which is tearing through the endurance of my SSDs. This seems to be ZFS committing a mostly empty block transaction to the disk, but regardless of recordsize, a small (i.e. 1KB) file write will cause a 200-400KB transactions.

My question is: is this expected behavior and is not alterable? Or am I missing a setting that would reduce the size of individual transactions?


I'm running FreeNAS-9.10.2-U1. this is a single disk pool with a USB HDD, 4k record size and ashift=12. I see the same behavior on a PCI-e SSD and also a pool of 6x SAS spinning disks behind a PERC H200.

running these two commands in this order, see the iostat output below, the very small dd write produces a very large transaction.

# dd if=/dev/random of=testfile1 bs=512 count=100000
predictably writes 50MB of data to the disk

# dd if=/dev/random of=testfile2 bs=512 count=1
512 bytes end up writing 343KB to the disk


# zpool iostat testusbhdd 1
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
testusbhdd 50.2M 3.62T 0 0 0 0
testusbhdd 100M 3.62T 0 427 0 50.0M
testusbhdd 100M 3.62T 0 0 0 0
testusbhdd 100M 3.62T 0 0 0 0
testusbhdd 100M 3.62T 0 0 0 0
testusbhdd 100M 3.62T 0 0 0 0
testusbhdd 100M 3.62T 0 26 0 343K
testusbhdd 100M 3.62T 0 0 0 0
testusbhdd 100M 3.62T 0 0 0 0
testusbhdd 100M 3.62T 0 0 0 0
 

benqrn

Cadet
Joined
May 11, 2017
Messages
5
The real issue I'm trying to solve is why approximately 1MB of disk writes I have going over an NFS mount ends up becoming 10MB worth of LBA writes to a single SSD I have on FreeNAS. The data on the NFS mount is KVM guest disk images stored on FreeNAS, and one guest in particular (a monitoring application updating every minute with many small writes) is causing most of the writes.

Someone sent me a message about extending the transaction length to 30 seconds, I ended up tinkering with that and some other settings. Extending the transaction did lower the writes by about 30%. Forcing the max record size to 16KB (anything smaller just added checksum overhead and didn't reduce overall disk writes) and re-copying the guest disk as a file with a 16KB record size also reduced writes another 30%.

Temporarily disabling zfs sync (causing zfs to skip the pool's on-disk zil for all writes) reduced writes by another 10%, but I'm left with roughly 3MB being written to the SSD for every 1MB of data from the guest. There is some write amplification in all SSDs which is unavoidable, 3:1 seems high for having grouped the writes into much larger transactions that are several MB in size (the disk is usually idle for almost the full 30 second txg timeout with these tweaks). 3:1 is a big improvement over the 10:1 though.

A coincidence I just noticed from my original post, 512B causes a ~343KB write (without the tweaks). In zdb I see that 'iblk' is 128K for ANY file, regardless of file size and record (dblk) size. Does this mean that the minimum write, regardless of record size, is actually 128KB? Because the same 3:1 write amplification in the SSD is being observed here as well if that is the case.

There may be nothing else I can tune due to much of the IO being <4KB block changes, I found an article reviewing write amplification of samsung SSDs (my test SSD is a Samsung 950 pro); http://www.anandtech.com/show/8239/update-on-samsung-850-pro-endurance-vnand-die-size

What is the 'iblk' column in zdb and what is the real minimum write size to a disk in zfs? I have a pool created a while ago in a previous version of FreeNAS with 16KB 'iblk' file sizes, which indicates something was changed and this is not a setting i can find a toggle for. FYI that 'iblk' is not governed by the recordsize property, 'dblk' is for record size (up to the max value set by recordsize).
 

benqrn

Cadet
Joined
May 11, 2017
Messages
5
Bumping this thread... only remaining item for me here is what is 'iblk' in zdb and why did it appear to change during FreeNAS 9.x from 16KB to 128KB? (iblk also appears to be unalterable)
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
I would guess indirect block. It is a metadata block similar to an inode list. If your small writes are going into a file that is more than tiny, then an iblk write of 128k (before compress) would also be part of the transaction. The smaller your dblk(recordsize) for a given file, the more iblks you are going to need, hence the reason it is illogical to allow user control over this particular metadata item.

For a tiny file, the iblk will be similarly tiny or even non-existent.
 
Status
Not open for further replies.
Top