Help getting the most out of NVMe SLOG

Brezlord · Dec 12, 2018

I'm at loss to work out why I cant get SLOG write performance with an Intel Optane 900p. I have no turntables set.

SMB Data set Sync Off. Great results.

SMB Data set Sync On. Not Good

Gstat backs this up. The Optane is running in a X8 pcie 3 slot at x4 speed direct to CPU not through PHC. Have tried a different x8 slot no change.

Code:

root@nas1:~ # diskinfo -wS /dev/nvd0
/dev/nvd0
        512             # sectorsize
        280065171456    # mediasize in bytes (261G)
        547002288       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        INTEL SSDPED1D280GA     # Disk descr.
        PHMB742301K1280CGN      # Disk ident.
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Synchronous random writes:
         0.5 kbytes:     16.7 usec/IO =     29.2 Mbytes/s
           1 kbytes:     16.8 usec/IO =     58.3 Mbytes/s
           2 kbytes:     17.1 usec/IO =    114.1 Mbytes/s
           4 kbytes:     14.5 usec/IO =    268.7 Mbytes/s
           8 kbytes:     16.7 usec/IO =    468.6 Mbytes/s
          16 kbytes:     21.4 usec/IO =    729.7 Mbytes/s
          32 kbytes:     30.2 usec/IO =   1035.4 Mbytes/s
          64 kbytes:     47.7 usec/IO =   1309.7 Mbytes/s
         128 kbytes:     83.4 usec/IO =   1499.0 Mbytes/s
         256 kbytes:    151.2 usec/IO =   1653.9 Mbytes/s
         512 kbytes:    282.9 usec/IO =   1767.1 Mbytes/s
        1024 kbytes:    546.1 usec/IO =   1831.1 Mbytes/s
        2048 kbytes:   1075.3 usec/IO =   1860.0 Mbytes/s
        4096 kbytes:   2112.9 usec/IO =   1893.1 Mbytes/s
        8192 kbytes:   4192.1 usec/IO =   1908.4 Mbytes/s

Any suggestions are welcome.

redbull666 · Feb 16, 2019

Did you ever fix the issue with the Optane 900p? Seems like a bit of a waste of 400 Euro otherwise!

Brezlord · Apr 21, 2019

Sorry for the late reply as I have been busy. I have done some more testing on both 8GB FC SAN and 10GbE storage.

I have some more changes to make to my 10GbE network. Bellow is the results with MTU of 1500 and storage pool 1 at 61% used and the pool is in use.

Bellow is my FC SAN storage pool 2 at 12% used and has 27 VMs powered on. I get 24.8K IOPs. Hosts connected via dual 8GB FC in round robin. Top VMs DISK Latency is 3.07ms on a loaded VM the rest are less than 1ms.

I have the Optane split in to multipul partitions 4 x 16GB SLOGs and 1 x L2ARC this would impact performance. But for a home server this makes more economical since than buying multipuil Optanes. Depending on your use case the Intel 900p Optane can make a big difference and is well worth it if you are running a setup like mine. I'm limited by the spinning disks in my storage pools. To sustain 10GB I would need more spindles than I have. If I copy a 5GB file it will copy at 1100MB/s but a 8GB file fill drop back to 650-730 MB/s and I believe that this is because the pool can't keep up with the SLOG. At the end of the day the sustained write speed will be only as fast as your pool.

Kind Regards,
Simon

SMnasMAN · Jul 13, 2019

Brezlord said:
View attachment 30273
View attachment 30274

hey, thanks for all the info and details you have posted throughout your thread, I have read it 2 or 3x times now. however I’m a bit confused on your most recent/final posts - am I correct to assume that your ultimate conclusion was your slower than expected performance (with sync=on and 900p as SLOG) ended up being due to 10G ethernet networking issues/slowness ( ie slow ~691mb/s sequential on 10g eth *vs* the faster ~935mb/s sequential speeds simply by changing to fiber channel)? in other words by using your dual 8G FC networking, the performance was much better?

ie: in The two crystalmark screenshots I quoted above- the only real hardware difference between the two results is that the second (faster result) is because you are using FC versus ethernet ? (right?)

( or am I reading the last page of your thread incorrectly, and you still do *not* know why you are seeing slower than expected performance with your optane slog and sync writes? )

thanks!

Brezlord · Jul 21, 2019

I think the main diferance in performance is the pools. The 10G ethernet is going to pool 1 8 x 4TB WD REDs SATA 6G RZ2 and the second pic is pool 2 7 x 2 way mirrors of 600GB 10K 6G SAS drives. In conclusion you can not write faster than the pool can sequentually write. The SLOG has to flush every 5 seconds to the pool. Home this helps.

jgreco · Jul 22, 2019

Brezlord said:
In conclusion you can not write faster than the pool can sequentually write. The SLOG has to flush every 5 seconds to the pool. Home this helps.

SLOG does not flush to the pool. SLOG is not a cache. The current transaction group is what flushes to the pool.

SLOG is basically write-only during normal operations, and the only time it is read is upon pool import. It's a log.

You can actually write faster than the pool can write (your word "sequentially" is at best superfluous and at worst wrong) but only for the length of time it takes to fill up two transaction groups. After that point, you are blocked because the system needs to finish flushing the first transaction group to the pool.

More information on what SLOG is and how it works is available at

https://www.ixsystems.com/community/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

Misunderstanding how this stuff works is going to make it harder for you to squeeze maximum performance out of it.

Brezlord · Sep 25, 2020

Ok I'm reviving this thread. I have a pool with 6 2 way mirrors of HGST HUSML4040ASS600 400GB SAS SSDs with an intel 900p slog. I'm using NFS as the protocol for VM disk storage over a 10G LAG. From with in a VM with sync off I'm consistently getting 943 MB/s but with sync always I get 870 MB/s. I cant understand how it could be slower than the pool as the optane should be faster than the pool so I would think that the sync writes should be as fast as they non sync writes.

As you can see below IOPS are better with sync disabled as well. This test is done on the TrunNAS host not over the network.

Sync always:

Code:

root@nas1:/mnt/vol3/test # fio --randrepeat=1 --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=2G --readwrite=randwrite --ramp_time=4
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=256
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [w(1)][88.4%][w=65.2MiB/s][w=16.7k IOPS][eta 00m:05s]
test: (groupid=0, jobs=1): err= 0: pid=4593: Fri Sep 25 20:29:55 2020
  write: IOPS=13.8k, BW=53.0MiB/s (56.6MB/s)(1842MiB/34113msec)
   bw (  KiB/s): min=50679, max=72596, per=99.60%, avg=55071.55, stdev=4466.22, samples=65
   iops        : min=12669, max=18149, avg=13767.48, stdev=1116.54, samples=65
  cpu          : usr=4.14%, sys=53.42%, ctx=956412, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,471540,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
  WRITE: bw=53.0MiB/s (56.6MB/s), 53.0MiB/s-53.0MiB/s (56.6MB/s-56.6MB/s), io=1842MiB (1931MB), run=34113-34113msec

Sync disabled:

Code:

root@nas1:/mnt/vol3/test # fio --randrepeat=1 --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=2G --readwrite=randwrite --ramp_time=4
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=256
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [w(1)][73.7%][w=173MiB/s][w=44.2k IOPS][eta 00m:05s]
test: (groupid=0, jobs=1): err= 0: pid=4658: Fri Sep 25 20:31:09 2020
  write: IOPS=38.3k, BW=150MiB/s (157MB/s)(1510MiB/10093msec)
   bw (  KiB/s): min=128066, max=268015, per=99.02%, avg=151669.89, stdev=31094.35, samples=19
   iops        : min=32016, max=67003, avg=37917.00, stdev=7773.56, samples=19
  cpu          : usr=7.01%, sys=73.43%, ctx=12837, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,386474,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
  WRITE: bw=150MiB/s (157MB/s), 150MiB/s-150MiB/s (157MB/s-157MB/s), io=1510MiB (1583MB), run=10093-10093msec

jgreco · Sep 25, 2020

Brezlord said:
Ok I'm reviving this thread. I have a pool with 6 2 way mirrors of HGST HUSML4040ASS600 400GB SAS SSDs with an intel 900p slog. I'm using NFS as the protocol for VM disk storage over a 10G LAG. From with in a VM with sync off I'm consistently getting 943 MB/s but with sync always I get 870 MB/s. I cant understand how it could be slower than the pool as the optane should be faster than the pool so I would think that the sync writes should be as fast as they non sync writes.

SLOG is not a cache. The writes to the SLOG happen alongside the writes to the pool. The pool writes are flushed as a transaction group periodically, while the SLOG is the thing that is providing the sync guarantee.

Please do read:

https://www.ixsystems.com/community/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

Misunderstanding how this stuff works is going to make it harder for you to squeeze maximum performance out of it.

Sync writes will ALWAYS be slower than standard writes. The best you can do is to get it to a point where it is only marginally slower. You appear to already be in that realm (870/943 is 92%, or 8% speed lost, which is awesome).

Brezlord · Sep 25, 2020

Thanks, I just re read that. I guess I thought I should be getting more, I'm greedy. Does the SLOG get written to at a Q depth of 1 or more than 1?

jgreco · Sep 25, 2020

"What's a Q depth"?

The nature of a sync write is that when you ask for a block of data to be written O_SYNC, the promise made by the OS is that it has been committed to stable storage when the write() call returns. This doesn't necessarily mean disk. It *can* be just battery-backed RAM, or flash memory, and the OS might shuffle it somewhere else later on. But once that write() call succeeds, even if the host crashes or power goes out in the next microsecond, you are going to be able to get the written data back later with a read().

The concept of "Q depth" is not compatible with sync writes. By definition, each block written follows the stuff that happens in "Laaaaaatency" in the post I linked to. In that same section, the sentence "Without sync writes, a client is free to just stack up a bunch of write requests and then they can send over a slowish channel, and they arrive when they can" is the answer to "What's a Q depth".

All of this is sort of a pedantic roundabout discussion in order to get you to understand why your question is sorta silly.

Your concept of "Q depth", however, comes from a low level view of storage hardware. And there's ANOTHER answer hidden in the details:

Queue depth is the number of outstanding I/O operations that might be pending. An I/O operation is generally a 512- or 4K-byte sector. You might have a device with 32 NCQ capability. This works out to 16KB or 128KB, and ... here's the kicker ... a ZFS block may easily be those sizes, meaning a single ZFS block can saturate the device's NCQ with a single block write.

So the thing is, ZFS doesn't limit the queue size at the hardware level. It writes multiple sectors to the SLOG as fast as it can go, but it is only writing the sectors for the current block it is trying to commit synchronously to the pool. It *must* wait for all sectors to be written to the SLOG before it acknowledges the ZFS block as having been written. Any hardware NCQ-type thing will be momentarily empty while that response is sent through the NFS protocol back to the client, and another block is sent for write.

Understanding this can help you make decisions about how to design block sizes for your NFS VM datastore.

Either way, you should note that it was very common to lose ~50%+ of your speed in the old days, using high quality SAS SLOG SSD's. You're only losing 8%. That's great, in my opinion.

Brezlord · Sep 25, 2020

Thanks for your in site. I will re read this a few times to make sure I've not missed anything.

HoneyBadger · Sep 25, 2020

For the purposes of "expected SLOG performance" versus "numbers on a spec sheet" - yes, you can look at the "Queue Depth 1" performance for rough estimates of SLOG.

As identified by @jgreco the size of a record write from ZFS can be many times larger than the sector size that the device accepts. A 128K I/O from ZFS being sent to a device that accepts 4Kn sectors will need to be broken into 32 device I/Os.

If the device is 512e, then you're going to take each of the 4K writes from ZFS (assuming ashift=12) and have to break it into 8 512-byte pieces to pass through the "logical barrier" to the device expecting 512 bytes at a time. The device firmware then reassembles them back into a single 4K.

But all of this back-end work just gets reduced by Marketing down to a single number that's probably titled "128K writes at QD1"

kspare · Sep 26, 2020

i've been tinkering with 32k and 64k block sizes on my freenas boxes. this starts to make more sense as to why with smaller blocks we see better IOPS, but it seems to be a trade off when doing svmotions..small blocks is great for production running, but once we need to move a vm, the smaller the block the slower the move. 32k and 64k seems to be a nice balance...but throw meta drives into the mix and now 32k seems to make more sense as you get more out of the meta drive? 32k takes betters advantage of queue depths? Meta? compression too? Easier on queue depths?

At some point with zfs, does block alignment make sense to achieve? ie, running 32k block on zfs and making all our windows boxes use 32k clusters as well instead of the stock ntfs size?

Brezlord · Feb 13, 2021

Performance on TrueNAS 12 has been below 11.3 but with release 12.0-U2 performance has improved to the point I'm CPU bound. I will put better CPUs in and re-test soon. A pool made up of 6x2 SAS SSD mirrors out preforms a single Intel Optane SSD. I need another Optane to create a mirrored slog and test again. All in all I'm happier with the performance of this update.

The pool under test has 16 running VMs on it and the windows VM shares a datastore with 7 running VMs.

Sync Enabled

Code:

------------------------------------------------------------------------------
CrystalDiskMark 8.0.1 x64 (UWP) (C) 2007-2021 hiyohiyo
                                  Crystal Dew World: https://crystalmark.info/
------------------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

[Read]
  SEQ    1MiB (Q=  8, T= 1):  1764.936 MB/s [   1683.2 IOPS] <  4729.32 us>
  SEQ    1MiB (Q=  1, T= 1):   745.217 MB/s [    710.7 IOPS] <  1404.37 us>
  RND    4KiB (Q= 32, T= 1):    92.611 MB/s [  22610.1 IOPS] <  1371.00 us>
  RND    4KiB (Q=  1, T= 1):    21.556 MB/s [   5262.7 IOPS] <   188.98 us>

[Write]
  SEQ    1MiB (Q=  8, T= 1):   920.601 MB/s [    878.0 IOPS] <  8974.71 us>
  SEQ    1MiB (Q=  1, T= 1):   563.093 MB/s [    537.0 IOPS] <  1857.51 us>
  RND    4KiB (Q= 32, T= 1):    82.793 MB/s [  20213.1 IOPS] <  1558.61 us>
  RND    4KiB (Q=  1, T= 1):    15.111 MB/s [   3689.2 IOPS] <   269.91 us>

Profile: Default
   Test: 2 GiB (x5) [C: 40% (101/249GiB)]
   Mode:
   Time: Measure 5 sec / Interval 5 sec
   Date: 2021/02/13 20:51:08
     OS: Windows 10 Professional [10.0 Build 19042] (x64)
Comment: ESXi 6.7 2 x 10GbE iSCSI RR --> TrueNAS 12.0-U2 6x2 SSD

Sync Disabled

Code:

------------------------------------------------------------------------------
CrystalDiskMark 8.0.1 x64 (UWP) (C) 2007-2021 hiyohiyo
                                  Crystal Dew World: https://crystalmark.info/
------------------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

[Read]
  SEQ    1MiB (Q=  8, T= 1):  1731.543 MB/s [   1651.3 IOPS] <  4832.94 us>
  SEQ    1MiB (Q=  1, T= 1):   765.896 MB/s [    730.4 IOPS] <  1366.89 us>
  RND    4KiB (Q= 32, T= 1):    91.497 MB/s [  22338.1 IOPS] <  1390.04 us>
  RND    4KiB (Q=  1, T= 1):    21.070 MB/s [   5144.0 IOPS] <   193.14 us>

[Write]
  SEQ    1MiB (Q=  8, T= 1):  1149.638 MB/s [   1096.4 IOPS] <  7248.69 us>
  SEQ    1MiB (Q=  1, T= 1):   703.703 MB/s [    671.1 IOPS] <  1483.57 us>
  RND    4KiB (Q= 32, T= 1):    83.092 MB/s [  20286.1 IOPS] <  1564.44 us>
  RND    4KiB (Q=  1, T= 1):    18.201 MB/s [   4443.6 IOPS] <   223.52 us>

Profile: Default
   Test: 2 GiB (x5) [C: 40% (101/249GiB)]
   Mode:
   Time: Measure 5 sec / Interval 5 sec
   Date: 2021/02/13 20:43:20
     OS: Windows 10 Professional [10.0 Build 19042] (x64)
Comment: ESXi 6.7 2 x 10GbE iSCSI RR --> TrueNAS 12.0-U2 6x2 SSD

Important Announcement for The TrueNAS Community.

Help getting the most out of NVMe SLOG

Brezlord

Contributor

redbull666

Cadet

Brezlord

Contributor

SMnasMAN

Contributor

Brezlord

Contributor

jgreco

Resident Grinch

Brezlord

Contributor

jgreco

Resident Grinch

Brezlord

Contributor

jgreco

Resident Grinch

Brezlord

Contributor

HoneyBadger

actually does care

kspare

Guru

Brezlord

Contributor

Similar threads