Need to quadruple or better my ZIL performance ...

swift99 · May 27, 2021

Good evening TrueNAS afficionados,

This may have been approached from another angle, but if so I didn't see it.

I have a TrueNAS 12 configured on a chassis with E5 processor, 128 GB RAM, 9 SAS drives in a zraid 3, and 500 gb SSD ZIL, backing a production virtual machine cluster on a 10 gbit dedicated storage network.

With one vm, I can push about 1.3 GB/ second across the wire ... For about 4.5 seconds, then the VM crashes for a drive fault.

As near as I can tell, I am pushing data across the wire about 3 times as fast as the SSD for the ZIL will take it, so the RAM cache fills up as the ZIL operations queue up, and the VM chokes. I observe the drive spikes to 100% busy under that load (no kidding).

What are some options for speeding up the ZIL? I aim to eventually hit 40 gbit throughput, but even a moderate (4x) increase in the ZIL performance to 2GB/s for sustained writes would be great.

I have seen some SSD's that advertise write speed up to 3GB/S. Would this give me the throughout for sustained writes performance that I am looking for? Or are there other bottlenecks I need to be looking at first?

c77dk · May 27, 2021

SLOG wont solve your issue - the underlying disks has to be able to absorb the writes, which I am certain can't happen right now (40Gbps is about 5GB/s - written to 9 spindles ....).

Haven't had my coffee yet, but I'm 99% certain that the 4-.5sec is the length of your transactiongroup, which then has to be flushed to disk.

you can use gstat to see how busy your disks are.

c77dk · May 27, 2021

For some reading on the topic: https://www.truenas.com/community/threads/the-path-to-success-for-block-storage.81165/

morganL · May 27, 2021

9 drives in a single RAID-Z3 VDEV will only support about 500-1000 IOPS.
Typical VM cluster has I/Os that are 32K- 64K size on average.
So the system is only designed for about 16 - 64MB/s
The laws of physics are against you meeting the goal.
4 Mirrors would give you about 6X more performance.

swift99 · May 28, 2021

Good info! So the sustained 600 iops with occasional spikes to 800 that I have observed is actually the max I can expect on this system, as built, due to the spindle layout. A faster zil/slog might improve the short term surge capability, but would not affect the sustainable iops.

A refactoring to 4 mirrors would potentially raise my sustainable throughput to 128 to 256 MB/s, based on actual current system performance. That's a non trivial refactor, but worth understanding.

swift99 · May 28, 2021

Based on your feedback so far, a little bit of internet sleuthing, and a high level understanding of my particular business need:

For optimum performance with the application software, there would be a general purpose vdev (the current spindle set) and four high speed working vdev's, which can be wiped and reloaded periodically.

My thought for refactoring the configuration is, for each high speed vdev, to add an NVME card and two NVME drives (2500+ GB'second, ??0,000 IOps) as mirrors. The additional hardware here is a lot cheaper than my team's time.

Would this configuration achieve my goals?

If so, the next question will need to be posted in the hardware section.

jgreco · May 28, 2021

c77dk said:
SLOG wont solve your issue - the underlying disks has to be able to absorb the writes, which I am certain can't happen right now (40Gbps is about 5GB/s - written to 9 spindles ....).

This is the the closest thing to the correct answer here; the best case scenario breaks down like this: 9 drives with RAIDZ3 parity gives you six drives worth of theoretical bandwidth, with modern drives being able to manage about 2Gbit/s max on the outer tracks IF all your traffic is sequential. There is literally no way for your setup to be able to write more than 12Gbit/s in the current design. I don't know where @morganL gets those optimistic "500-1000 IOPS" numbers. I'd give it 300-400, possibly faster when the pool is young. IOPS scale poorly on RAIDZ.

If you had 8 drives in mirror pairs, the number of potential IOPS increases, but the theoretical maximum write speeds drop further, down to about 8Gbit/s, because you have four vdevs each able to write at a maximum of 2Gbit/s. However, each vdev is capable of perhaps 150 IOPS, so 150 * 4 = 600 IOPS.

What exactly is your goal? You're really not going to get 40Gbps with hard disk without a massive number of disks. However, if you don't have an actual need for SLOG, you should be able to get very good speeds with modern NVMe SSD. Adding sync writes is always slower than async writes, so if you really want fast, contemplate if you can stage stuff to a fast NVMe pool without SLOG, and then migrate stuff to slower disk as it isn't needed.

swift99 · May 28, 2021

jgreco said:
snip ...

if you really want fast, contemplate if you can stage stuff to a fast NVMe pool without SLOG, and then migrate stuff to slower disk as it isn't needed.

Thanks for the confirmation.

I had come to pretty much the same conclusion. My latest post pretty much outlines the implementation details I am considering (won't repost it). Not everything needs insanely fast IO, only specific database replatforming processes. 40GBit is a long term goal. not hanging the VM (apparently a kvm bug) on 10 GBit bursts of more than a few seconds is the short term need.

morganL · May 28, 2021

jgreco said:
I don't know where @morganL gets those optimistic "500-1000 IOPS" numbers. I'd give it 300-400, possibly faster when the pool is young. IOPS scale poorly on RAIDZ.

I was assuming some caching benefits in a virtualization workload (typ. 32K I/O size). The actual measured performance of 600- 800 IOPS probably includes some cached IOPS (ARC) and some ZFS write aggregation.

jgreco · May 28, 2021

morganL said:
I was assuming some caching benefits in a virtualization workload (typ. 32K I/O size). The actual measured performance of 600- 800 IOPS probably includes some cached IOPS (ARC) and some ZFS write aggregation.

We're talking about hypervisor write workload. I'm not aware of any way in which ARC or L2ARC would do much for you there, and datastore storage is also one of the most pessimistic write environments due to fragmentation effects, so neither is "ZFS write aggregation."

morganL · May 28, 2021

jgreco said:
We're talking about hypervisor write workload. I'm not aware of any way in which ARC or L2ARC would do much for you there, and datastore storage is also one of the most pessimistic write environments due to fragmentation effects, so neither is "ZFS write aggregation."

I was talking about general VM I/O performance including Reads.. However, we typically see write aggregation of 3-4X in VMware environments.

HoneyBadger · May 31, 2021

morganL said:
Typical VM cluster has I/Os that are 32K- 64K size on average.

I assume here you're talking about writes to the underlying ZFS pool after aggregation here, not writes from the perspective of the client or even SLOG? General-purpose remote VM workloads over NFS/iSCSI that I've observed are almost exclusively 4K/8K in size, with VMware's svMotion clocking in at the 64K chunk where possible.

ZFS can consolidate them into larger records (and also apply compression) but you don't want to make them too much larger or you end up with read-modify-write behaviour if you have to update a smaller chunk of that record. 32K zvols have been the Goldilocks value in my experience, treading the fine line between "big enough to compress" and "small enough to deliver good random performance."

Important Announcement for the TrueNAS Community.

Need to quadruple or better my ZIL performance ...

swift99

Cadet

c77dk

Patron

c77dk

Patron

morganL

Captain Morgan

swift99

Cadet

swift99

Cadet

jgreco

Resident Grinch

swift99

Cadet

morganL

Captain Morgan

jgreco

Resident Grinch

morganL

Captain Morgan

HoneyBadger

actually does care

Similar threads