Need to quadruple or better my ZIL performance ...

swift99

Cadet
Joined
Nov 10, 2019
Messages
9
Good evening TrueNAS afficionados,

This may have been approached from another angle, but if so I didn't see it.

I have a TrueNAS 12 configured on a chassis with E5 processor, 128 GB RAM, 9 SAS drives in a zraid 3, and 500 gb SSD ZIL, backing a production virtual machine cluster on a 10 gbit dedicated storage network.

With one vm, I can push about 1.3 GB/ second across the wire ... For about 4.5 seconds, then the VM crashes for a drive fault.

As near as I can tell, I am pushing data across the wire about 3 times as fast as the SSD for the ZIL will take it, so the RAM cache fills up as the ZIL operations queue up, and the VM chokes. I observe the drive spikes to 100% busy under that load (no kidding).

What are some options for speeding up the ZIL? I aim to eventually hit 40 gbit throughput, but even a moderate (4x) increase in the ZIL performance to 2GB/s for sustained writes would be great.

I have seen some SSD's that advertise write speed up to 3GB/S. Would this give me the throughout for sustained writes performance that I am looking for? Or are there other bottlenecks I need to be looking at first?
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
468
SLOG wont solve your issue - the underlying disks has to be able to absorb the writes, which I am certain can't happen right now (40Gbps is about 5GB/s - written to 9 spindles ....).

Haven't had my coffee yet, but I'm 99% certain that the 4-.5sec is the length of your transactiongroup, which then has to be flushed to disk.

you can use gstat to see how busy your disks are.
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
468

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
9 drives in a single RAID-Z3 VDEV will only support about 500-1000 IOPS.
Typical VM cluster has I/Os that are 32K- 64K size on average.
So the system is only designed for about 16 - 64MB/s
The laws of physics are against you meeting the goal.
4 Mirrors would give you about 6X more performance.
 

swift99

Cadet
Joined
Nov 10, 2019
Messages
9
Good info! So the sustained 600 iops with occasional spikes to 800 that I have observed is actually the max I can expect on this system, as built, due to the spindle layout. A faster zil/slog might improve the short term surge capability, but would not affect the sustainable iops.

A refactoring to 4 mirrors would potentially raise my sustainable throughput to 128 to 256 MB/s, based on actual current system performance. That's a non trivial refactor, but worth understanding.
 

swift99

Cadet
Joined
Nov 10, 2019
Messages
9
Based on your feedback so far, a little bit of internet sleuthing, and a high level understanding of my particular business need:

For optimum performance with the application software, there would be a general purpose vdev (the current spindle set) and four high speed working vdev's, which can be wiped and reloaded periodically.

My thought for refactoring the configuration is, for each high speed vdev, to add an NVME card and two NVME drives (2500+ GB'second, ??0,000 IOps) as mirrors. The additional hardware here is a lot cheaper than my team's time.

Would this configuration achieve my goals?

If so, the next question will need to be posted in the hardware section.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
SLOG wont solve your issue - the underlying disks has to be able to absorb the writes, which I am certain can't happen right now (40Gbps is about 5GB/s - written to 9 spindles ....).

This is the the closest thing to the correct answer here; the best case scenario breaks down like this: 9 drives with RAIDZ3 parity gives you six drives worth of theoretical bandwidth, with modern drives being able to manage about 2Gbit/s max on the outer tracks IF all your traffic is sequential. There is literally no way for your setup to be able to write more than 12Gbit/s in the current design. I don't know where @morganL gets those optimistic "500-1000 IOPS" numbers. I'd give it 300-400, possibly faster when the pool is young. IOPS scale poorly on RAIDZ.

If you had 8 drives in mirror pairs, the number of potential IOPS increases, but the theoretical maximum write speeds drop further, down to about 8Gbit/s, because you have four vdevs each able to write at a maximum of 2Gbit/s. However, each vdev is capable of perhaps 150 IOPS, so 150 * 4 = 600 IOPS.

What exactly is your goal? You're really not going to get 40Gbps with hard disk without a massive number of disks. However, if you don't have an actual need for SLOG, you should be able to get very good speeds with modern NVMe SSD. Adding sync writes is always slower than async writes, so if you really want fast, contemplate if you can stage stuff to a fast NVMe pool without SLOG, and then migrate stuff to slower disk as it isn't needed.
 

swift99

Cadet
Joined
Nov 10, 2019
Messages
9
snip ...

if you really want fast, contemplate if you can stage stuff to a fast NVMe pool without SLOG, and then migrate stuff to slower disk as it isn't needed.

Thanks for the confirmation.

I had come to pretty much the same conclusion. My latest post pretty much outlines the implementation details I am considering (won't repost it). Not everything needs insanely fast IO, only specific database replatforming processes. 40GBit is a long term goal. not hanging the VM (apparently a kvm bug) on 10 GBit bursts of more than a few seconds is the short term need.
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I don't know where @morganL gets those optimistic "500-1000 IOPS" numbers. I'd give it 300-400, possibly faster when the pool is young. IOPS scale poorly on RAIDZ.

I was assuming some caching benefits in a virtualization workload (typ. 32K I/O size). The actual measured performance of 600- 800 IOPS probably includes some cached IOPS (ARC) and some ZFS write aggregation.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I was assuming some caching benefits in a virtualization workload (typ. 32K I/O size). The actual measured performance of 600- 800 IOPS probably includes some cached IOPS (ARC) and some ZFS write aggregation.

We're talking about hypervisor write workload. I'm not aware of any way in which ARC or L2ARC would do much for you there, and datastore storage is also one of the most pessimistic write environments due to fragmentation effects, so neither is "ZFS write aggregation."
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
We're talking about hypervisor write workload. I'm not aware of any way in which ARC or L2ARC would do much for you there, and datastore storage is also one of the most pessimistic write environments due to fragmentation effects, so neither is "ZFS write aggregation."

I was talking about general VM I/O performance including Reads.. However, we typically see write aggregation of 3-4X in VMware environments.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Typical VM cluster has I/Os that are 32K- 64K size on average.

I assume here you're talking about writes to the underlying ZFS pool after aggregation here, not writes from the perspective of the client or even SLOG? General-purpose remote VM workloads over NFS/iSCSI that I've observed are almost exclusively 4K/8K in size, with VMware's svMotion clocking in at the 64K chunk where possible.

ZFS can consolidate them into larger records (and also apply compression) but you don't want to make them too much larger or you end up with read-modify-write behaviour if you have to update a smaller chunk of that record. 32K zvols have been the Goldilocks value in my experience, treading the fine line between "big enough to compress" and "small enough to deliver good random performance."
 
Top