On-disk cache and ZFS performance

Status
Not open for further replies.

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
So I have been troubleshooting the awful raw disk write (~15MB/s) in the past few days, and it turns out to the combination of disabled disk cache (the 128M cache on each HDD), FreeBSD's hard limit of 128K MAX per IO, single queue depth of the dd command: https://forums.freenas.org/index.php?threads/troubleshooting-low-disk-write-speed.70217/

Now with this sorted out, I decided to do some simple testing to see its impact on ZFS performance which I would like to share.
I created 2 single disk pool, cacheon (on-disk cache enabled) and cacheoff (on-disk disabled), within each pool, 2 datasets syncon (sync=always) and syncoff (sync=disabled). Compression is disabled in all datasets.
Code:
root@freenas:/mnt/cacheon/syncon # dd if=/dev/zero of=ddtest bs=1M count=1K
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 33.871098 secs (31700827 bytes/sec)

 L(q)  ops/s	r/s   kBps   ms/r	w/s   kBps   ms/w   %busy Name
	6	306	  0	  0	0.0	276  31577	2.9   95.9| da2


root@freenas:/mnt/cacheon/syncoff # dd if=/dev/zero of=ddtest bs=1M count=1K
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 0.594406 secs (1806411267 bytes/sec)

 L(q)  ops/s	r/s   kBps   ms/r	w/s   kBps   ms/w   %busy Name
   0   1543	  0	  0	0.0   1543 197442	2.8   91.8| da2


root@freenas:/mnt/cacheoff/syncon # dd if=/dev/zero of=ddtest bs=1M count=1K
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 27.064734 secs (39673097 bytes/sec)

 L(q)  ops/s	r/s   kBps   ms/r	w/s   kBps   ms/w   %busy Name
5	458	  0	  0	0.0	416  40820   14.0   96.7| da3


root@freenas:/mnt/cacheoff/syncoff # dd if=/dev/zero of=ddtest bs=1M count=1K
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 0.723659 secs (1483767229 bytes/sec)

 L(q)  ops/s	r/s   kBps   ms/r	w/s   kBps   ms/w   %busy Name
  8	550	  0	  0	0.0	550  70345   11.9   97.1| da3



The first thing I noticed was ZFS was smart enough to mitigate the awful raw speed, which is brilliant. With disk cache disabled and sync=always, it can still do ~2.5x raw speed by queuing up commands; when sync=disable, everything get write into the RAM very quickly, and in the backend data is piping into disk at even faster (>4x) speed.

Now with disk cache enabled, things gets more interesting. When sync=disable, again data gets dumped into RAM at lightspeed, and in the backend disk write at its full potential, thanks to the cache. The could be a benefit if ZFS is subject to heavy sustained seq writes, like video surveillance? When sync=always, surprisingly it is ~20% slower comparing to disk cache disabled. I can only assume that because ZFS have to constantly tell the disk to flush its cache, which takes time.

This is more pronounced if we move to a smaller block size. I didn't bother to include sync=disable because (I believe) if you are facing lots of small IOs, more likely than not the use case calls for sync=always (like VM, database). I also didn't include gstat results as you can see from above it agrees with dd pretty well when sync=always.
Code:
root@freenas:/mnt/cacheon/syncon # dd if=/dev/zero of=ddtest bs=16K count=8K
8192+0 records in
8192+0 records out
134217728 bytes transferred in 89.989479 secs (1491482 bytes/sec)

root@freenas:/mnt/cacheon/syncon # dd if=/dev/zero of=ddtest bs=4K count=8K
8192+0 records in
8192+0 records out
33554432 bytes transferred in 82.307550 secs (407671 bytes/sec)

root@freenas:/mnt/cacheoff/syncon # dd if=/dev/zero of=ddtest bs=16K count=8K
8192+0 records in
8192+0 records out
134217728 bytes transferred in 27.925497 secs (4806279 bytes/sec)

root@freenas:/mnt/cacheoff/syncon # dd if=/dev/zero of=ddtest bs=4K count=8K
8192+0 records in
8192+0 records out
33554432 bytes transferred in 23.053595 secs (1455497 bytes/sec)

We are seeing a >300% performance increase by disabling the on-disk cache! This may be counter-intuitive but makes sense if you think about the headroom of flushing the cache after every small-ish IOs.


Now some thoughts:

1. Is it safe to enable the on disk cache? Obviously if data integrity is at risk performance is meaningless. And AFAIK cache on HDD are volatile, which means data store in them are gone in event of power failure. Now that ZFS is clearly flushing the cache for sync writes, I can only imagine it will also flush the cache before considering a transaction group is committed to safe storage. However I don't have any reference to this.
2. Should one disable the on-disk cache to get better performance with smaller IOs? The oversimplified test above may suggest so. But one in this situation (sync=always, small IOs) might be better off getting a proper SLOG. My understanding is that this will effectively put the HDDs into the same situation as sync=disable, where data get write onto them in transaction groups, and on-disk cache certainly helps with this.
3. About increasing the max IO size, setting it larger than 128K this should benefit in many cases. I believe this is set with MAXPHYS, unfortunately it cannot be changed short of re-compiling the kernel:http://freebsd.1045724.x6.nabble.com/Time-to-increase-MAXPHYS-td6189400.html. I am not going to do it in fear of breaking something, but should be doable.



I hope you find this interesting. Any input is appreciated. Even a simple confirmation on my thoughts would help me establish confidence as a n00b in FreeNAS/BSD/ZFS.
 
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
First I would like to say thank you for taking the time to perform these tests and post the results.

The first thing I noticed was ZFS was smart enough to mitigate the awful raw speed, which is brilliant. With disk cache disabled and sync=always, it can still do ~2.5x raw speed by queuing up commands; when sync=disable, everything get write into the RAM very quickly, and in the backend data is piping into disk at even faster (>4x) speed.
It's still not writing to RAM but it will queue the IO. Sync writes and never never acknowledged as done until safe on disk (or SLOG)
Now with disk cache enabled, things gets more interesting. When sync=disable, again data gets dumped into RAM at lightspeed, and in the backend disk write at its full potential, thanks to the cache. The could be a benefit if ZFS is subject to heavy sustained seq writes, like video surveillance? When sync=always, surprisingly it is ~20% slower comparing to disk cache disabled. I can only assume that because ZFS have to constantly tell the disk to flush its cache, which takes time.
As long as the disk (write heads, disks have a number of internal queues and buffers that are all but hidden) queue has the next write before the previous is done, sequential writes wont gain much from a large drive cache.
We are seeing a >300% performance increase by disabling the on-disk cache! This may be counter-intuitive but makes sense if you think about the headroom of flushing the cache after every small-ish IOs.
This sounds like the disk is thrashing, the buffer fills, once full it need to flush and pause long enough to make room for new writes, takes a batch to fill the cache, rinse and repeat.
1. Is it safe to enable the on disk cache? Obviously if data integrity is at risk performance is meaningless. And AFAIK cache on HDD are volatile, which means data store in them are gone in event of power failure. Now that ZFS is clearly flushing the cache for sync writes, I can only imagine it will also flush the cache before considering a transaction group is committed to safe storage. However I don't have any reference to this.
This is correct but is usually deemed an acceptable risk. Keep in mind that the disk cache not only flushed when full but at a set time interval as well. This interval will not be a long (5+ second period of time but it all depends on the manufacturer). Its all about balancing risk with performance for an application. This is the precise reason that when using a SLOG its important to use a PLP (Power Loss Protected) drive.
Should one disable the on-disk cache to get better performance with smaller IOs? The oversimplified test above may suggest so. But one in this situation (sync=always, small IOs) might be better off getting a proper SLOG. My understanding is that this will effectively put the HDDs into the same situation as sync=disable, where data get write onto them in transaction groups, and on-disk cache certainly helps with this.
Generally no. There may be some highly specific cases where there may be some gains but you will almost never see such a fixed IO profile. It's usually much more mixed. As for the SLOG, if you run sync writes and care about performance you will get a SLOG. Even then sync+SLOG will ALWAYS be slower than the same pool sync=disabled.
3. About increasing the max IO size, setting it larger than 128K this should benefit in many cases. I believe this is set with MAXPHYS, unfortunately it cannot be changed short of re-compiling the kernel:http://freebsd.1045724.x6.nabble.com/Time-to-increase-MAXPHYS-td6189400.html. I am not going to do it in fear of breaking something, but should be doable.
FreeBSD is generally regarded as having a solid performing IO stack also if I remember correctly, ZFS has a max record size of 128k so that's a perfect match!

If you find any errors in my comments, please point them out and cite sources if able. I don't know everything and my understandings are not facts but interpretations of things that I have read.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Hi thanks for reading my post, although I found I have to disagree for the most part. I could be totally wrong but these are my logical deduction from my best knowledge:

First I would like to say thank you for taking the time to perform these tests and post the results.


It's still not writing to RAM but it will queue the IO. Sync writes and never never acknowledged as done until safe on disk (or SLOG)
OK may be I used the wrong terminology but since the disk can never achieve a speed even close to ~1GB/s, the data have to be stored in RAM temperory before HDD can catch up.

As long as the disk (write heads, disks have a number of internal queues and buffers that are all but hidden) queue has the next write before the previous is done, sequential writes won't gain much from a large drive cache.
Now if you look at the gstat at the end of each dd test. When sync=disabled, the disk is writing at ~70MB/s and ~190MB/s respectively, when disk cache is off and on. So in the latter case it will take longer to fill the same amount of RAM with sustained seq write.

This sounds like the disk is thrashing, the buffer fills, once full it need to flush and pause long enough to make room for new writes, takes a batch to fill the cache, rinse and repeat.
I don't think the cache is being filled, again according to gstat the disk itself is writing at ~40MB/s, which is ~1/5 what is capable of. I lean towards that ZFS is actively instructing the disk to flush the cache because of sync writes:https://docs.oracle.com/cd/E26505_01/html/E37386/chapterzfs-6.html
Code:
ZFS also issues a flush every time an application requests a synchronous write (O_DSYNC, fsync, NFS commit, and so on). The completion of this type of flush is waited upon by the application and impacts performance.

This is Oracle ZFS but I think this applies to OpenZFS as well.

This is correct but is usually deemed an acceptable risk. Keep in mind that the disk cache not only flushed when full but at a set time interval as well. This interval will not be a long (5+ second period of time but it all depends on the manufacturer).
You are right about the interval but I believe ZFS will actively flush the disk cache regardless, see reference above.

Thanks again for reading, I would like to hear what you think.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
OK may be I used the wrong terminology but since the disk can never achieve a speed even close to ~1GB/s, the data have to be stored in RAM temperory before HDD can catch up.
Considering you did not supply any information about your system I can not speculate on expected performance but I can say sync writes do not get acknowledged until safe on disk (or SLOG). That's the defining trait of synchronous writes.
Now if you look at the gstat at the end of each dd test. When sync=disabled, the disk is writing at ~70MB/s and ~190MB/s respectively, when disk cache is off and on. So in the latter case it will take longer to fill the same amount of RAM with sustained seq write.
This may have to do with the internal queue being quite short and having the cache there to keep them filled. I don't know enough about the internal IO path/stack of hard drives.
I don't think the cache is being filled
Perhaps not filled, but could still cause "micro bursting" inside the drives write path and in turn thrashing some internal buffer/queue causing the reduced performance.
You are right about the interval but I believe ZFS will actively flush the disk cache regardless, see reference above.
Yes quite true, especially for sync writes.

I am by no means an expert and only intend to share my interpretations and understandings.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Considering you did not supply any information about your system I can not speculate on expected performance but I can say sync writes do not get acknowledged until safe on disk (or SLOG). That's the defining trait of synchronous writes.
Good catch, I will include my hardware in my sig. But I said writes goes to RAM just when interpreting results from sync=disable, so I guess we are in agreements here:)

Perhaps not filled, but could still cause "micro bursting" inside the drives write path and in turn thrashing some internal buffer/queue causing the reduced performance.
I guess we are ultimately saying the same thing. As ZFS is asking for acknowledgement for data on platter, so the disk cannot use its cache to queue as efficiently.
 
Status
Not open for further replies.
Top