ZFS Breathing/Write Stalls

paleoN · Jun 13, 2012

As the title suggest this post is about ZFS Breathing/Write Stalls many of us have seen. I was experiencing it myself though to a lesser degree compared to some. I decided I would run some benchmarks and see what I could come up with.

My current system is:
FreeNAS-8.2.0-BETA3-x64
ASUS F1A75-V Pro
AMD A6-3500 Llano CPU
8GB DDR3 RAM
2 x Seagate ST2000DM001 2TB
Intel 9301 CT NIC

Relevant Tuneables:
vfs.zfs.arc_max 4539433088 Generated by autotune
vm.kmem_size 4490539929 Generated by autotune

Client:
Win XP SP3 FileZilla 3.5.3
6.12GB, 10GB, 15.3GB, 50GB files

I'm using FTP to benchmark as I get very consistent and fast transfers. I rebooted after every new array setup. I began by blowing away my mirror and creating a UFS mirror as a control. UFS performed as I knew it would.

NOTE: I'm using Bandwidth Monitor to create the graphs. It has inaccuracies in that you get a false positive, false negative?, as often as everything 35 or so seconds. If you see a small blue bar at the very bottom ignore that dip. It didn't actually happen.

NOTE2: I know the graphs are too tall, but I'm not redoing them.

I then proceeded to create a ZFS stripe. It appeared to manage well with an empty pool and the auto-tuned settings for an 8 GB NAS.

One thing to keep in mind is that an empty pool performs better than one that's been in use for a while.

I proceeded to recreate my ZFS mirror and ran into the above. I then loaded up the mirror with 300+ gig of data and transfered & deleted
things over a few of days. My problem returned though not as severe as I remember.

Finally, on to some tuning. I first enabled vfs.zfs.txg.timeout with a value of 5, the default as of ZFS v28. Very noticeable improvements with my setup.

This is a CIFs transfer by the way.

I then calculated the max raw write speed, of my faster drive, as 188 MB/s. Which is what I set vfs.zfs.txg.write_limit_override to 197132288.

Looks like around 80MB/s is all the mirror can sustain with that value.

I next tried with 2x 188 MB/s - 394264576, which just happens to be about 3x 120 MB/s.

I then tested vfs.zfs.txg.write_limit_override at 1073741824, often recommended for 8GB systems, and at 591396864, 3x 188 MB/s. Both are too high for my system and I get too much latency during ZFS write flushing.

I currently running at 492830720, 2.5x.

492830720 seems a great match with the mirror I'm running. I think it's slightly better than 394264576, but I didn't really run with
that value long enough to be sure.

Finally, I reran the dd benchmark with my current settings and it came out to 120 mb/s. I've definitely tuned my NAS for something and it's never run better, but at the end I'm not entirely sure why

The 1073741824 for 8GB systems never made sense to me with my limited understanding the write performance of your zpool should be the limiting factor though network speed plays its part.

I'm interested if anyone else has tried any ZFS tuning and what the results were.

paleoN · Jun 13, 2012

For those of you who need more graphs:
UFS mirror
ufsmirrorwriteseq.png

ZFS stripe
zfsstripemultisend.png

Default Settings
zfsmirrorftp50g.png, zfsmirrorftp50gp2def.png, zfsmirrorftp50gp3def.png, zfsmirrorftpu50gbdef.png, zfsmirrorftpu50gbp2def.png, zfsmirrorftp50gp3def.png, zfsmirroruftp153gbdef.png

vfs.zfs.txg.timeout=5
zfsmirrortxgtimeout5.png

vfs.zfs.txg.write_limit_override=197132288
zfsmirrortxgwrcifs.png, zfsmirrortxgwrftp.png, zfsmirrortxgwrftp2.png

vfs.zfs.txg.write_limit_override=394264576
zfsmirrortxgx210gb.png

Final settings
zfsmirror10gb96sec.png, zfsmirror153gbcurr.png, zfsmirror50gbcurr.png, zfsmirror50gbcurrp2.png, zfsmirror50gbcurrp3.png, zfsmirrormulticurr.png

Graphs used above are green.

ProtoSD · Jun 14, 2012

I did some playing around and kept a spreadsheet of the values and results, but in the end it got too confusing and didn't make any sense. I have quite a long tutorial I started writing but haven't posted because I need to double check stuff etc., but I also found that vfs.zfs.txg.write_limit_override at 1073741824 wasn't optimal. I'd need to login to my NAS to check what I have currently.

Anyway, nice write up, thanks for taking time to post all of that!

jgreco · Jul 20, 2012

You're not likely to make any meaningful sense of it. The underlying issue is that ZFS appears to use a poor strategy for the txg mechanism. To make it brief, you can maybe find a write_limit_override that works pretty well for writing sequential files on your particular hardware. This is mainly a function of number of drives, RAID level, and drive write speeds. For a four-drive array in RAIDZ2, it appears that the maximum sequential write speed typically winds up being about a quarter to a half of the maximum write speed of the individual components. HOWEVER! Throw any other traffic in the mix, and you totally hose it. That's because disks take substantially longer to seek and write than they do to just write. So calculating a limit based on just one special case write speed is broken.

What it appears that ZFS really ought to base the txg size on is the number of discrete zones being written to, where "zone" is defined as the locality on a disk that incurs no seek penalty. That may also not be completely ideal, but it'd be a darn sight better than what exists now. You can really screw with a ZFS pool by seeking randomly and writing 512-byte blocks.

There will not be a magic write_limit_override that works well in general. Setting it very low severely reduces ZFS's blocking-during-txg-flush, but also severely reduces throughput.

paleoN · Jul 24, 2012

jgreco said:
To make it brief, you can maybe find a write_limit_override that works pretty well for writing sequential files on your particular hardware. This is mainly a function of number of drives, RAID level, and drive write speeds.

Hmm, yes this is what I was thinking it might be.

jgreco said:
For a four-drive array in RAIDZ2, it appears that the maximum sequential write speed typically winds up being about a quarter to a half of the maximum write speed of the individual components.

And that's much worse than I would have thought.

jgreco said:
What it appears that ZFS really ought to base the txg size on is the number of discrete zones being written to, where "zone" is defined as the locality on a disk that incurs no seek penalty. That may also not be completely ideal, but it'd be a darn sight better than what exists now. You can really screw with a ZFS pool by seeking randomly and writing 512-byte blocks.

I half remember an article where they talked about not filling a zone 100% as that reduced performance when it was almost full. Which again was caused by the seeking if I remember right. A new ZFS version is the only real way to address the problem.

jgreco said:
There will not be a magic write_limit_override that works well in general. Setting it very low severely reduces ZFS's blocking-during-txg-flush, but also severely reduces throughput.

I'm happier with what I arrived at than what I started with. It hasn't seemed to reduce my throughput much if any. It no doubt helps that I have a small number of clients and I'm not using FreeNAS as backend storage for VMs.

Thanks for the insight jgreco.

jgreco · Jul 25, 2012

paleoN said:
And that's much worse than I would have thought.

That's really not too awful if you consider the typical performance loss of a conventional RAID 5 setup. The additional resiliency of the setup makes it a fairly reasonable price to pay.

I half remember an article where they talked about not filling a zone 100% as that reduced performance when it was almost full. Which again was caused by the seeking if I remember right. A new ZFS version is the only real way to address the problem.

I'm not sure what you saw, but ZFS performance degrades substantially (as does UFS/FFS) as the pool reaches capacity. That's not what I was talking about. The point I was making is that any operation that requires a drive to seek SUBSTANTIALLY reduces the I/O capacity of a drive. There is almost no difference in speed between reading one disk block and a thousand contiguous disk blocks. However, reading three randomly placed disk blocks will (on average) take a huge amount of time.

The ZFS mechanism, as far as I can tell, flushes based on the amount of data written, so if you write a thousand random blocks, that's going to cause a problem. I haven't determined the exact nature of this problem just yet. In theory, with ZFS being a copy-on-write filesystem, the actual writes might get streamed out to disk relatively sequentially, but if so, then the file's been fragmented so that read speeds will be impacted. Otherwise you're hosed during writes. Either way it may be suboptimal.

paleoN · Jul 25, 2012

jgreco said:
That's really not too awful if you consider the typical performance loss of a conventional RAID 5 setup. The additional resiliency of the setup makes it a fairly reasonable price to pay.

I hadn't thought about it in that light comparing to conventional RAID 5.

jgreco said:
I'm not sure what you saw, but ZFS performance degrades substantially (as does UFS/FFS) as the pool reaches capacity. That's not what I was talking about.

It's not what I was talking about either. The term is metaslab and not zone which helps when you are searching for it because you didn't bookmark it. This is the article I was thinking about. The performance improvement came from "more write aggregation".

Important Announcement for the TrueNAS Community.

ZFS Breathing/Write Stalls

paleoN

Wizard

paleoN

Wizard

ProtoSD

MVP

jgreco

Resident Grinch

paleoN

Wizard

jgreco

Resident Grinch

paleoN

Wizard

Similar threads