right Optane configuration to get superb IOPS for ZIL

PaxonSK · Feb 18, 2019

Hi,

I have HP system with two Optane 900p for ZIL mirror, but always get maxx 5k-10k IOPS for 4k blocksize in sync writes - tested all time with fio

If I test write directly to nvd0 or slice nvd0p1, I get more than 200k IOPS , but when I do write test on UFS, or as ZIL devices in mirror (whole device, or separe slice)
, still not get more than 5k-10k IOPS with sync writes.

Aligned to 128x512b size:

# gpart show nvd0
=> 40 547002208 nvd0 GPT (261G)
40 88 - free - (44K)
128 4194304 1 freebsd-swap (2.0G)
4194432 542807808 2 freebsd-zfs (259G)
547002240 8 - free - (4.0K)

Where can be problem ??

FreeNAS-11.1-U7 , 512GB RAM, 4x Intel(R) Xeon(R) CPU X7560 @ 2.27GHz , HP DL580 G7, (yes I know, PCI-E v2 there, but still, for Optane less than 10k sync writes is still horrible result in this configuration)

Thank you

Chris Moore · Feb 18, 2019

Perhaps I overlooked it, but I don't see anything about the disk configuration that backs this up.

PaxonSK said:
but when I do write test on UFS

How are you getting UFS in FreeNAS?
Please give full system configuration and you might find some guidance on what we are looking for in this post:

Updated Forum Rules 12/5/18
https://forums.freenas.org/index.php?threads/updated-forum-rules-12-5-18.45124/

Based on your use of terms, ZIL, for example, I think it might help to ensure we are all using the same words if you review these guides:

Slideshow explaining VDev, zpool, ZIL and L2ARC
https://forums.freenas.org/index.ph...ning-vdev-zpool-zil-and-l2arc-for-noobs.7775/

Terminology and Abbreviations Primer
https://forums.freenas.org/index.php?threads/terminology-and-abbreviations-primer.28174/

Chris Moore · Feb 18, 2019

Something a lot of people misunderstand is that you are not creating some kind of cache between the write and the disk pool. You are still writing to disk, so the speed of the disk pool matters.

PaxonSK · Feb 19, 2019

Hi,

sorry for fast&direct question :) than again:

I working with storage system longer, ZFS is challenge and now days not only o for high price enterprise ;)

All tests what I do are with fio tool, changing parameters in bold:
$ fio --filename=test_fio_file --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-te --size=100M

Chris Moore said:
How are you getting UFS in FreeNAS?

I know that FreeNAS have nice GUI tool, but if you like to do more deep engineering investigation, console is a must ;) and newfs command is still part of FreeNAS with UFS support.

I do test on :
- mirror pool only with 2x Optanes 900p 280GB - ashift 12 and 13
- 2 stripped RAIDZ2 with 12x 4TB HDD (6x Thosiba, 6x HGST) over LSI SAS2308 IT mode connected to HP D2600 with SLOG mirror Optanes - ashift 12 and 13
- mirror(6-to-6) of 12x 4TB HDD (6x Thosiba, 6x HGST) over LSI SAS2308 IT mode connected to HP D2600 with SLOG mirror Optanes - ashift 12 and 13
- direct UFS on Optane (newfs & mount, no problem - with tunefs disabled/enabled TRIM, mounted with/without noatime, fs with 4/8/128kB BS)

Optane partition aligned to 128x512B , see first post

to mirror Optane pool: --direct=1 --sync=1 --rw=write --bs=4k --numjobs=100
Jobs: 100 (f=100): [W(100)][16.4%][r=0KiB/s,[B]w=16.6MiB/s][r=0,w=4259 IOPS [/B][eta 00m:00s]

to mirror Optane pool: --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1
Jobs: 1 (f=1): [W(1)][18.0%][r=0KiB/s,w=20.0MiB/s][r=0,w=5370 IOPS[eta 00m:00s]

every time from zpool iostat maxx 7-8k IOPS

capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
local_nvme 1.07G 257G 0 6.79K 0 113M
mirror 1.07G 257G 0 6.79K 0 113M
gptid/51ddec4d-3460-11e9-926a-6805ca8ce59a - - 0 6.79K 0 113M
gptid/52ff0cf0-3460-11e9-926a-6805ca8ce59a - - 0 6.79K 0 113M

Chris Moore said:
Based on your use of terms, ZIL, for example, I think it might help to ensure we are all using the same words if you review these guides:

I am looking ahead only for SLOG and good _s_ync writes , async writes it handled perfectly, see end of this post.

On 2x stripped RAIDZ2 pool with mirror NVMe SLOG :
--direct=1 --sync=1 --rw=write --bs=4k --numjobs=100
Jobs: 100 (f=100): [W(100)][31.1%][r=0KiB/s,w=21.1MiB/s][r=0,w=5403 IOPS][eta 00m:00s]

capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
sas1_pool 67.8G 43.4T 0 2.97K 0 119M
raidz2 33.9G 21.7T 0 0 0 0
gptid/2e8da3f4-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/30400089-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/32b78c12-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/3b7accf8-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/3d52a37e-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/4054cdec-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
raidz2 33.9G 21.7T 0 0 0 0
gptid/349e8c21-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/36e6f8b0-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/388513fa-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/42451792-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/450d9e70-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/4706ff42-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
logs - - - - - -
mirror 1.10G 257G 0 2.96K 0 118M
gptid/a3618eaf-3463-11e9-926a-6805ca8ce59a - - 0 2.96K 0 118M
gptid/a45f1ba2-3463-11e9-926a-6805ca8ce59a - - 0 2.96K 0 118M
-------------------------------------- ----- ----- ----- ----- ----- -----

On 2x stripped RAIDZ2 pool with mirror NVMe SLOG :
--direct=1 --sync=1 --rw=write --bs=4k --numjobs=1
Jobs: 1 (f=1): [W(1)][32.8%][r=0KiB/s,w=20.0MiB/s][r=0,w=5364 IOPS][eta 00m:00s]

capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
sas1_pool 68.0G 43.4T 0 6.06K 0 96.9M
raidz2 34.0G 21.7T 0 0 0 0
gptid/2e8da3f4-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/30400089-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/32b78c12-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/3b7accf8-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/3d52a37e-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/4054cdec-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
raidz2 34.0G 21.7T 0 0 0 0
gptid/349e8c21-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/36e6f8b0-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/388513fa-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/42451792-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/450d9e70-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
gptid/4706ff42-3109-11e9-bff0-6805ca8ce59a - - 0 0 0 0
logs - - - - - -
mirror 505M 258G 0 6.05K 0 96.9M
gptid/a3618eaf-3463-11e9-926a-6805ca8ce59a - - 0 6.05K 0 96.9M
gptid/a45f1ba2-3463-11e9-926a-6805ca8ce59a - - 0 6.05K 0 96.9M
-------------------------------------- ----- ----- ----- ----- ----- -----

now UFS:

$ newfs -b 4096 gptid/a3618eaf-3463-11e9-926a-6805ca8ce59a
gptid/a3618eaf-3463-11e9-926a-6805ca8ce59a: 265042.9MB (542807808 sectors) block size 4096, fragment size 4096
using 5508 cylinder groups of 48.12MB, 12320 blks, 6160 inodes.

--direct=1 --sync=1 --rw=write --bs=4k --numjobs=1
Jobs: 1 (f=1): [W(1)][34.4%][r=0KiB/s,w=55.9MiB/s][r=0,w=14.3k IOPS][eta 00m:00s]

--direct=1 --sync=1 --rw=write --bs=4k --numjobs=100
Jobs: 100 (f=100): [W(100)][31.1%][r=0KiB/s,w=15.6MiB/s][r=0,w=3984 IOPS[eta 00m:00s]

$ newfs gptid/a3618eaf-3463-11e9-926a-6805ca8ce59a
gptid/a3618eaf-3463-11e9-926a-6805ca8ce59a: 265042.9MB (542807808 sectors) block size 32768, fragment size 4096
using 424 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.

--direct=1 --sync=1 --rw=write --bs=4k --numjobs=100
Jobs: 100 (f=100): [W(100)][20.0%][r=0KiB/s,w=11.6MiB/s][r=0,w=2979 IOPS][eta 00m:00s]

--direct=1 --sync=1 --rw=write --bs=4k --numjobs=1
Jobs: 1 (f=1): [W(1)][19.7%][r=0KiB/s,w=20.0MiB/s][r=0,w=5126 IOPS][eta 00m:00s]

Now to partition direct write:

$ fio --filename=/dev/gptid/a3618eaf-3463-11e9-926a-6805ca8ce59a --direct=1 --sync=1 --rwte --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-te --size=100M
Jobs: 1 (f=1): [W(1)][23.3%][r=0KiB/s,w=114MiB/s][r=0,w=29.1k IOPS][eta 00m:00s] >> 29.1k IOPS

Jobs: 10 (f=10): [W(10)][15.0%][r=0KiB/s,w=803MiB/s][r=0,w=206k IOPS][eta 00m:00s] >> 206k IOPS

Jobs: 100 (f=100): [W(100)][26.2%][r=0KiB/s,w=1246MiB/s][r=0,w=319k IOPS][eta 00m:00s] >> 319k IOPS

Yes, I know that UFS, ZFS, or SLOG, or whatever take some overhead, but is we compare 100jobs direct write to device to SLOG, UFS, or mirror pool with only cca 5k sync IOPS, it is horrible

===========================================================
For comparsion async direct writes:

On mirror NVMe pool: --direct=1 --sync=0 --rw=write --bs=1M --numjobs=100
Jobs: 100 (f=100): [W(100)][24.6%][r=0KiB/s,w=3516MiB/s][r=0,w=3516 IOPS][eta 00m:00s]

On mirror NVMe pool: --direct=1 --sync=0 --rw=write --bs=1M --numjobs=1
Jobs: 1 (f=1): [W(1)][16.7%][r=0KiB/s,w=1801MiB/s][r=0,w=1800 IOPS][eta 00m:00s]

On 2x stripped RAIDZ2 pool without NVMe SLOG : --direct=1 --sync=0 --rw=write --bs=1M --numjobs=1
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2108MiB/s][r=0,w=2108 IOPS][eta 00m:00s] - zpool iostat showing cca 200MB/s write to pool

On 2x stripped RAIDZ2 pool without NVMe SLOG : --direct=1 --sync=0 --rw=write --bs=1M --numjobs=100
Jobs: 100 (f=100): [W(100)][90.0%][r=0KiB/s,w=5757MiB/s][r=0,w=5757 IOPS][eta 00m:00s] - zpool iostat showing cca 60MB/s write to pool - on 512GB RAM system no problem ;)

This async write performance hitting 10GB NFS without any problem, what we like to have:)
but sync write over NFS take only cca 2-3k IOPS per connection, more NFS connections in sum takes maxx 6-7k IOPS, sometimes hit 10k IOPS
NFS exported from ZFS pools (HDD pool with/without SLOG or mirror NVMe pool)

PaxonSK · Feb 25, 2019

Hi,

please, some tips?

thank you :)

jgreco · Feb 25, 2019

PaxonSK said:
Hi,

please, some tips?

thank you :)

You're unlikely to get any tips. What you're experimenting with is outside the FreeNAS framework. Most of the guys here are enthusiasts ("hobbyists") and don't have spare gear to mess with. A small number of contributors do work with FreeNAS in an enterprise or commercial environment, but tend to stay on-script. An even smaller number of us do general hacking for commercial, or sometimes entertainment, purposes, but I know I'm not very active here anymore because I have a lot of other things I'm working on, and I don't have any Optane stuff anyways.

I will note that you're off in left field here, because you're focusing on a subsystem that isn't even used in the FreeNAS framework. You were already told that and didn't seem to care. This isn't a great recipe for getting answers.

I haven't looked extensively at Optane and I don't have a real good feel for what the latency and parallelization behaviours are like; your question feels strongly related.

You have to remember that the POSIX sync write requires a guarantee that the data has been committed to stable storage. This can be actually written to disk, or to an intermediate cache of some sort, but once written, the hardware and operating system are guaranteeing that it will be retrievable in the written format even under adverse conditions such as power loss. This is inherently going to be a hell of a lot slower, meaning lots fewer IOPS, than if you just queue up write commands without sync.

Many people are stunned that their "capable of billion IOPS" device works out to a few thousand (or even just high hundreds) in practice, but there are so many layers to go through.

https://forums.freenas.org/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

Relevant:

Laaaaaaaaatency. Low is better.

The SLOG is all about latency. The "sync" part of "sync write" means that the client is requesting that the current data block be confirmed as written to disk before the write() system call returns to the client. Without sync writes, a client is free to just stack up a bunch of write requests and then they can send over a slowish channel, and they arrive when they can. Look at the layers:

Client initiates a write syscall
Client filesystem processes request
Filesystem hands this off to the network stack as NFS or iSCSI
Network stack hands this packet off to network silicon
Silicon transmits to switch
Switch transmits to NAS network silicon
NAS network silicon throws an interrupt
NAS network stack processes packet
Kernel identifies this as a NFS or iSCSI request and passes to appropriate kernel thread
Kernel thread passes request off to ZFS
ZFS sees "sync request", sees an available SLOG device
ZFS pushes the request to the SAS device driver
Device driver pushes to LSI SAS silicon
LSI SAS chipset serializes the request and passes it over the SAS topology
SAS or SATA SSD deserializes the request
SSD controller processes the request and queues for commit to flash
SSD controller confirms request
SSD serializes the response and pssses it back over the SAS topology
LSI SAS chipset receives the response and throws an interrupt
SAS device driver gets the acknowledgment and passes it up to ZFS
ZFS passes acknowledgement back to kernel NFS/iSCSI thread
NFS/iSCSI thread generates an acknowledgement packet and passes it to the network silicon
NAS network silicon transmits to switch
Switch transmits to client network silicon
Client network silicon throws an interrupt
Client network stack receives acknowledgement packet and hands it off to filesystem
Filesystem says "yay, finally, what took you so long" and releases the syscall, allowing the client program to move on.

That's what happens for EACH sync write request. So on a NAS there's not a hell of a lot you can do to make this all better. However, you CAN do things like substituting in low-latency NVMe in place of SAS, and upgrade from gigabit to ten gigabit ethernet.

PaxonSK · Mar 11, 2019

jgreco said:
You're unlikely to get any tips. What you're experimenting with is outside the FreeNAS framework. Most of the guys here are enthusiasts ("hobbyists") and don't have spare gear to mess with. A small number of contributors do work with FreeNAS in an enterprise or commercial environment, but tend to stay on-script. An even smaller number of us do general hacking for commercial, or sometimes entertainment, purposes, but I know I'm not very active here any more because I have a lot of other things I'm working on, and I don't have any Optane stuff anyways.

I will note that you're off in left field here, because you're focusing on a subsystem that isn't even used in the FreeNAS framework. You were already told that and didn't seem to care. This isn't a great recipe for getting answers.

I haven't looked extensively at Optane and I don't have a real good feel for what the latency and parallelization behaviours are like; your question feels strongly related.

This explain whole problem and OKay :) I am not so superb to get more deeper in FreeBSD to change state or find where it scrubs at developers/engineering level.

I compare to Linux, where I have same two Optanes, IOPS on ext4 get 4k sync writes at 60k IOPS per write job(single Optane), but on same machine to ZFS mirrored Optane pool (ZoL 0.7.8) get max 15k IOPS

than, optimizing ZFS to get higher IOPS with Optane&similar HW are on place , but still, thank you;)

Important Announcement for the TrueNAS Community.

right Optane configuration to get superb IOPS for ZIL

PaxonSK

Cadet

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

PaxonSK

Cadet

PaxonSK

Cadet

jgreco

Resident Grinch

PaxonSK

Cadet

Similar threads