Help getting the most out of NVMe SLOG

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
Hi all. I was hoping someone could help me with slow performance issues with my NVMe SLOG. I am only getting 230MB/s write to the drive but I get between 696MB/s to 1164MB/s when testing the drive locally. I have 10GiB networking and get expected speeds with MTU set to 1518, just over 9.8GiB/s. When I write to the pool with sync off I get 545MB/s. Below are the test results for the NMVe drive.

Code:
root@nas1:~ # diskinfo -t /dev/nvd0
/dev/nvd0
	   512			 # sectorsize
	   480103981056	# mediasize in bytes (447G)
	   937703088	   # mediasize in sectors
	   0			   # stripesize
	   0			   # stripeoffset
	   SAMSUNG MZ1LV480HCHP-000MU	  # Disk descr.
	   S2C1NAAH600033  # Disk ident.
	   Yes			 # TRIM/UNMAP support
	   0			   # Rotation rate in RPM

Seek times:
	   Full stroke:	  250 iter in   0.012374 sec =	0.049 msec
	   Half stroke:	  250 iter in   0.011606 sec =	0.046 msec
	   Quarter stroke:   500 iter in   0.018414 sec =	0.037 msec
	   Short forward:	400 iter in   0.015944 sec =	0.040 msec
	   Short backward:   400 iter in   0.017356 sec =	0.043 msec
	   Seq outer:	   2048 iter in   0.058376 sec =	0.029 msec
	   Seq inner:	   2048 iter in   0.062954 sec =	0.031 msec

Transfer rates:
	   outside:	   102400 kbytes in   0.093565 sec =  1094426 kbytes/sec
	   middle:		102400 kbytes in   0.080247 sec =  1276060 kbytes/sec
	   inside:		102400 kbytes in   0.167414 sec =   611657 kbytes/sec


Any help in solving this would be greatly appreciated.
Simon

edit: add code tags
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
SLOG writes are at queue depth of 1, so I'm going to suggest this is "performance as expected" although you may be able to tune a bit more depending on your dataset. What's the recordsize and nature of the dataset/zvol being serviced?

If you can briefly remove the drive from the pool (the fact that you've tested sync=disabled suggests yes) and run the command at the top of the SLOG thread, that would be helpful. Also please post using (CODE) tags for readability.

Command (warning: destructive writes to device):
diskinfo -wS /dev/XXX

Thread:
https://forums.freenas.org/index.php?threads/slog-benchmarking-and-finding-the-best-slog.63521/
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
Thanks for you reply. The record size is 128K. The data sets are mainly used for storage for audio editing / archiving on SMB shares. I have set the data set to sync always to push writes to the NVMe SLOG instead of the sniping disks. I will get 900+ MB/s to non sync data set but when the transaction log is flushed from ram the file transfer really slows then speeds up the slows and repeats. I was hopping to get 500MB/s to the NVMe SLOG I have and if that would well upgrade to an Intel 900P.

Code:
root@nas1:~ # nvmecontrol identify nvme0
Controller Capabilities/Features
================================
Vendor ID:				  144d
Subsystem Vendor ID:		144d
Serial Number:			  S2C1NAAH600033
Model Number:			   SAMSUNG MZ1LV480HCHP-000MU
Firmware Version:		   BXV76M8Q
Recommended Arb Burst:	  2
IEEE OUI Identifier:		00 25 38
Multi-Interface Cap:		00
Max Data Transfer Size:	 Unlimited
Controller ID:			  0x01

Admin Command Set Attributes
============================
Security Send/Receive:	   Not Supported
Format NVM:				  Supported
Firmware Activate/Download:  Supported
Namespace Managment:		 Supported
Abort Command Limit:		 8
Async Event Request Limit:   4
Number of Firmware Slots:	3
Firmware Slot 1 Read-Only:   Yes
Per-Namespace SMART Log:	 Yes
Error Log Page Entries:	  64
Number of Power States:	  1

NVM Command Set Attributes
==========================
Submission Queue Entry Size
  Max:					   64
  Min:					   64
Completion Queue Entry Size
  Max:					   16
  Min:					   16
Number of Namespaces:		1
Compare Command:			 Supported
Write Uncorrectable Command: Supported
Dataset Management Command:  Supported
Volatile Write Cache:		Not Present

Namespace Drive Attributes
==========================
NVM total cap:			   0
NVM unallocated cap:		 0


Code:
root@nas1:~ # smartctl -a /dev/nvme0
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:					   SAMSUNG MZ1LV480HCHP-000MU
Serial Number:					  S2C1NAAH600033
Firmware Version:				   BXV76M8Q
PCI Vendor/Subsystem ID:			0x144d
IEEE OUI Identifier:				0x382500
Controller ID:					  1
Number of Namespaces:			   1
Namespace 1 Size/Capacity:		  480,103,981,056 [480 GB]
Namespace 1 Utilization:			3,177,054,208 [3.17 GB]
Namespace 1 Formatted LBA Size:	 512
Namespace 1 IEEE EUI-64:			002538 2266200033
Local Time is:					  Mon Sep 17 21:22:15 2018 AWST
Firmware Updates (0x07):			3 Slots, Slot 1 R/O
Optional Admin Commands (0x000e):   Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x005f):	 Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp

Supported Power States
St Op	 Max   Active	 Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +	 8.00W	   -		-	0  0  0  0	   30	  30

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +	 512	   0		 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:				   0x00
Temperature:						27 Celsius
Available Spare:					100%
Available Spare Threshold:		  10%
Percentage Used:					0%
Data Units Read:					12,872 [6.59 GB]
Data Units Written:				 67,696,497 [34.6 TB]
Host Read Commands:				 188,906
Host Write Commands:				934,361,473
Controller Busy Time:			   533
Power Cycles:					   37
Power On Hours:					 4,466
Unsafe Shutdowns:				   21
Media and Data Integrity Errors:	0
Error Information Log Entries:	  2
Warning  Comp. Temperature Time:	95
Critical Comp. Temperature Time:	69

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc		  LBA  NSID	VS
  0		  2	 0  0x001b  0x4004  0x000			0	 0	 -
  1		  1	 0  0x001b  0x4004  0x000			0	 0	 -


Code:
root@nas1:~ # diskinfo -wS /dev/nvd0
/dev/nvd0
		512			 # sectorsize
		480103981056	# mediasize in bytes (447G)
		937703088	   # mediasize in sectors
		0			   # stripesize
		0			   # stripeoffset
		SAMSUNG MZ1LV480HCHP-000MU	  # Disk descr.
		S2C1NAAH600033  # Disk ident.
		Yes			 # TRIM/UNMAP support
		0			   # Rotation rate in RPM

Synchronous random writes:
		 0.5 kbytes:	 24.9 usec/IO =	 19.6 Mbytes/s
		   1 kbytes:	 25.0 usec/IO =	 39.1 Mbytes/s
		   2 kbytes:	 26.5 usec/IO =	 73.8 Mbytes/s
		   4 kbytes:	 26.7 usec/IO =	146.5 Mbytes/s
		   8 kbytes:	 28.1 usec/IO =	278.0 Mbytes/s
		  16 kbytes:	 32.7 usec/IO =	477.4 Mbytes/s
		  32 kbytes:	 57.6 usec/IO =	542.5 Mbytes/s
		  64 kbytes:	115.0 usec/IO =	543.3 Mbytes/s
		 128 kbytes:	229.3 usec/IO =	545.2 Mbytes/s
		 256 kbytes:	458.9 usec/IO =	544.8 Mbytes/s
		 512 kbytes:	918.2 usec/IO =	544.5 Mbytes/s
		1024 kbytes:   1837.8 usec/IO =	544.1 Mbytes/s
		2048 kbytes:   3681.4 usec/IO =	543.3 Mbytes/s
		4096 kbytes:   7345.4 usec/IO =	544.6 Mbytes/s
		8192 kbytes:  14784.1 usec/IO =	541.1 Mbytes/s
 
Last edited:
Joined
Dec 29, 2014
Messages
1,135
First let me say that I am the happy owner of an Optane 900P that I use an SLOG. I am quite happy to get some sustained periods of 4G+ writes. I get sustained periods of 8G reads (which doesn't involve the SLOG). You have to keep in mind that performance tuning has elements of whack-a-mole. :) There is always something that is the slowest component. By changing/tuning things, you just move around what that is. Ideally you get to a point where the slowest component is up to the level that you want to achieve. At some point during sustained writes, your SLOG is going to be waiting on your ZFS pool. How is configured? zpool list -v
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
@Brezlord thanks for the info. Being able to write >500MBs async would seem to imply that your pool isn't the bottleneck here, at least not at its current capacity.

recordsize is an upper boundary, so even if it's set to the default 128K your application may be sending smaller writes (8K, judging by the 230MB/s limit vs. your drive benchmark?)

I noticed in your config (in your sig) that you appear to be running a virtual FreeNAS - what's the hypervisor? I wonder if there's some NVMe shenanigans or BIOS/EFI settings that need to be set in order to obtain better performance. Samsung only seems to provide a Windows NVMe driver for that drive.
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
@Elliot Dierksen my config as requested.

Code:
root@nas1:~ # zpool list -v
NAME									 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
freenas-boot							 111G  3.35G   108G		-		 -	  -	 3%  1.00x  ONLINE  -
  mirror								 111G  3.35G   108G		-		 -	  -	 3%
   ada0p2								  -	  -	  -		-		 -	  -	  -
   ada1p2								  -	  -	  -		-		 -	  -	  -
vol1									  29T  16.7T  12.3T		-		 -	10%	57%  1.00x  ONLINE  /mnt
  raidz2								  29T  16.7T  12.3T		-		 -	10%	57%
   gptid/87bbb0b8-b0de-11e7-b013-0050568664c9	  -	  -	  -		-		 -	  -	  -
   gptid/88efee3c-b0de-11e7-b013-0050568664c9	  -	  -	  -		-		 -	  -	  -
   gptid/8a69819d-b0de-11e7-b013-0050568664c9	  -	  -	  -		-		 -	  -	  -
   gptid/8bc946ca-b0de-11e7-b013-0050568664c9	  -	  -	  -		-		 -	  -	  -
   gptid/5dd04cc7-c60a-11e7-bf36-000c29765df9	  -	  -	  -		-		 -	  -	  -
   gptid/8e98dafa-b0de-11e7-b013-0050568664c9	  -	  -	  -		-		 -	  -	  -
   gptid/8f8facd4-b0de-11e7-b013-0050568664c9	  -	  -	  -		-		 -	  -	  -
   gptid/90a072c4-b0de-11e7-b013-0050568664c9	  -	  -	  -		-		 -	  -	  -
log										 -	  -	  -		 -	  -	  -
  gptid/376c6c48-ba7c-11e8-b5d8-90e2ba3a89c4   444G  1.41M   444G		-		 -	 0%	 0%
vol2									1.81T  1.99M  1.81T		-		 -	 0%	 0%  1.00x  ONLINE  /mnt
  gptid/3d190491-b8d1-11e8-9fc9-90e2ba3a89c4   928G   980K   928G		-		 -	 0%	 0%
  gptid/3e74bc3a-b8d1-11e8-9fc9-90e2ba3a89c4   928G  1.04M   928G		-		 -	 0%	 0%
 
Last edited:

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
@HoneyBadger I was running it as a VM on ESXi 6.7 with direct pass through but have installed it on bare metal Dell T320 as per my signature which I updated last night. The NVMe is in a PCIe 3.0 slot with 4x data direct to CPU so should be more than enough. I have thought about swapping slots with the HBA as a trial to see what would happen. A work mate sent me this like to look over as it has many tunings for OPNsense with runs on BSD, there are many zfs parameters but I don't have much knowledge on what they do. https://calomel.org/freebsd_network_tuning.html
 
Joined
Dec 29, 2014
Messages
1,135
I am assuming that "vol1" is the pool you are using for these tests. That pool has a single RAIDZ2 vdev with 8 drives. That is within general guidelines (6 - 8 drives per RZ2 vdev is the general consensus). vol2 looks a stripe (aka RAID0) vdev with two devices. I hope you aren't storing anything critical there, because losing either drive will kill the pool. Do you get the same results for files on vol2 and you do on vol1? You could also do a local test on each pool. dd bs=1m if=/mnt/vol1/some_big_file of=/dev/null and then try dd bs=1m if=/mnt/vol2/some_big_file of=/dev/null.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
@HoneyBadger I was running it as a VM on ESXi 6.7 with direct pass through but have installed it on bare metal Dell T320 as per my signature which I updated last night. The NVMe is in a PCIe 3.0 slot with 4x data direct to CPU so should be more than enough. I have thought about swapping slots with the HBA as a trial to see what would happen. A work mate sent me this like to look over as it has many tunings for OPNsense with runs on BSD, there are many zfs parameters but I don't have much knowledge on what they do. https://calomel.org/freebsd_network_tuning.html
Writing with sync off gives you higher speeds, so I'm not sure that network tuning is the answer here.

The fact that your NVMe drive is hitting a very flat limit of ~545MB/s makes me curious - it's not even slowly tapering off, it's just hitting that number and basically staying within 1MB/s of it - coincidentally that's right around the bandwidth cap of a PCIe 2.0 x1 slot.

Just to double-check:

dell_t320.jpg


Which physical slot is your PM953 in, numbered from the top down?
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
@Elliot Dierksen vol2 is only used for testing, nothing important on that pool.

@HoneyBadger I have PCIe cards installed in the following order.

Slot 6 HBA x4
Slot 5 PM953 NVMe
Slot 3 Intel X540 x16
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Assuming you meant slot 4 for the NVMe because the board-labeled "slot 5" doesn't exist; either way, all three lower slots are connected directly to the CPU and not the PCH, so it shouldn't be bottlenecking there.

If this is an M.2 on an adapter card, it's possible that your adapter card is somehow faulty or only wired for a single PCIe lane.

The other side of the equation is ServeTheHome's SLOG benchmark graph:

Intel-Optane-NAND-NVMe-SATA-SAS-diskinfo-ZFS-ZIL-SLOG-Pattern-MB.jpg


This shows a Samsung PM953 M.2 480GB hitting that same flat wall at around the same 550MB/s ... so it may actually be the limit of the drive.

(But notice the Optane 900p flying way, way above it.)
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
I've read that and yes the 900p smashes everything. The PM953 gets ~545MB/s in testing but in use as a SLOG it only manages ~330MB/s. Is this normal or do I have tuning issues with FreeNAS?

windows.JPG
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The workload of a regular write load vs the SLOG is different; a regular write load is more akin to "write this and get back to me when you can" versus an SLOG write which is "write this RIGHT NOW and tell me as soon as you're finished."

Decreased performance is expected even in a drive with PLP because it still has the controller and CPU overhead of having to do the extra confirmation, but that would be against the higher (1GB/s) numbers from your opening post.

At the default 128K recordsize your real-world transfers should still be closer to the tested 545MB/s - maybe network tuning is part of the solution here.
 
Joined
May 10, 2017
Messages
838
The fact that your NVMe drive is hitting a very flat limit of ~545MB/s makes me curious - it's not even slowly tapering off, it's just hitting that number and basically staying within 1MB/s of it - coincidentally that's right around the bandwidth cap of a PCIe 2.0 x1 slot.

Couldn't be a x1 Slot, theoretical Max is 500MB/s but max usable is around 400/420MB/s
 
Last edited:

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
I have an Intel Optane 900p 280G on order, hopefully get it next week. We'll see how that preforms. Anyone able to point me to some info on tuning FreeNAS for 10GbE?
 

mjt5282

Contributor
Joined
Mar 19, 2013
Messages
139
IMHO the standard "Out of the box" config for 10Gbe in 11.2RC1/2 is pretty good.
I got a speed "bump" for rsync when i upgraded the source and target to 11.2RC1/2
PS: i should mention many of the high-speed ethernet optimizations were added to freebsd-11.2
Sub-Forums: 9
 
Last edited:

_bolek_

Cadet
Joined
Aug 29, 2016
Messages
7
@Brezlord The result you have with Samsung NVMe is accurate. Problem is on in you setup or optimization, but on disk memory structure. if you read about Optane memory structure you will see that Optane operate almost that same no matter what Qdepth you have. So as long as You use Optine drive you will see result you expect.

I have tasted Optine drive so far in many different application (SQL, cache, base Drive) and its always show better result then other type of memory, ever latest Samsung enterprise SSD with over 700k IOPS lost to Optane drive in almost every test.
 

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
Just for reference this is what I get with 8 x WDC WD40EFRX in a Raid Z2 with Samsung MZ1LV480HCHP NMVe SLOG. Hopefully the Optane 900p will improve this. Any tips for 10GbE tuning?

Sync off.JPG
Sync on.JPG
 
Last edited:

Brezlord

Contributor
Joined
Jan 7, 2017
Messages
189
I have 29 VMs with 23 VMs powered on at the moment on a 3 host cluster with 2 hosts powered down most of the time. Most of the VMs are doing not much at all as this is just my home/lab setup I use for learning/playing. I plan on creating a new pool with 10K SAS drives just for a VM iSCSI data store and would like to split the Optane into two partitions so that both pools can have a SLOG. Is some one able to point me to some info on partitioning on FreeBSD?
 
Joined
Dec 29, 2014
Messages
1,135
Top