Beefy 9.2.x box not performing as expected... ESXi 5.5 host

Status
Not open for further replies.

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
This is my first post on the forums and I've read countless hours for several weeks, and tried so many settings that I'm finally going to ask for help because I just can't seem to get an expected result from this box.

Hardware setup:
Supermicro X9SRH-7TF w/X540 Intel 10GB NICs
Xeon E5-1650 v2 3.5Ghz Hyperthreaded 6-core CPU
HBA LSI2308 based (using "mps" driver)
128 GB of RAM

There are 18 drives in the machine. 10 SAS 2TB drives, 6 SSDs for a volume, and 2 SSDs for ZIL/SLOG.

2 Different ZFS volumes are configured:

Volume1: 6 SSDs (I tried Raidz2 and Raid1+0)
Volume2: 10 SAS 2TB Drives (I tried Raidz2 and Raid1+0), 2 SSDs mirror'd for ZIL/SLOG.

I ran all of jgreco's tests that he recommended in this thread:
http://forums.freenas.org/index.php?threads/write-performance-issues-mid-level-zfs-setup.13372/

On the FreeNAS shell, running those tests I get amazing results.

For dd writes on the SAS array I get:
1373164797952 bytes transferred in 300.424637 secs (4570746299 bytes/sec)

For dd reads on the SAS array I get:
1373164797952 bytes transferred in 151.147971 secs (9084903935 bytes/sec)

I have lz4 (default) compression on.

Running the iostat against a 262144 block size, each drive is maintaining a fairly steady 148-155MB/s on the SAS volume and 250-255MB/s on the SSD volume. iostat on the SSD volume yields even higher read/write results for the dd test.

Running individual drive tests yields the same results, so there is no bottle neck on the bus or HBA.

iPerf tests between the FreeNAS box and my host yields a steady 9.4-9.6G/bits so there's no issues on the network layer. Physical configuration is FreeNAS box and ESXi host plugged into isolated 10Gig switch, I even tried a 10G cross over cable and connected the host directly to the FreeNAS box and eliminated the switch from the equation. No change in the results.

I've run autotune off and autotune on, I've tried net.inet.tcp.delayed_ack=0 and kern.ipc.nmbclusters at various levels, no change in my results.

Autotune set these syctls for me:

kern.ipc.maxsockbuf = 2097152
net.inet.tcp.recvbuf_max = 2097152
net.inet.tcp.sendbuf_max = 2097152

Tunables set by Autotune:

vfs.zfs.arc_max = 97174518588
vm.kmem_size = 107971687321
vm.kmem_size_max = 134964609152

I've tried NFS and iSCSI based reads/writes and when I try to do dd based tests with the FreeNAS box mounted to the host I average 500-515M/Byte writes (doing 1M and 4M block sizes) and 330-340M/Byte reads. (Which makes no sense, reads should be faster...)

I've tried with and without the SLOG, and see no difference in performance. Memory usage is at a fairly steady 30Gb. CPU has never gone above 20%.

In iSCSI mode, with sync=standard (the default) I see these results above. If I set sync=always my write speeds drop to 31MByte/s. My NFS speeds are similar in default mode (even with the SLOG), and if I disable sync it performs similar to iSCSI in normal mode.

I'm truly at a loss here, on the FreeNAS box, on the command line the drives are performing at incredible rates, but when connecting to the box via NFS/iSCSI performance is 1/10th to 1/25th of the on-box levels.

So my question is... how bad is the ESXi overhead, and is there any additional settings/tweaks I can make or tests I can run to try and isolate the issue?

Should the variance of on-box vs off-box performance be this extreme? I've had standard hardware RAID setups perform way better than this, so I feel that something just isn't right.

Any assistance is appreciated.

Thanks much!

-F
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
I think the current recommendation is to turn Hyperthreading off ... at least for E3-v3, not sure whether it applies to this specific processors only but it might be worth a shot.
Then the question is which SSDs you use ?
If the speed is good with sync=off and bad with sync=always then the disks can't write the sync calls fast enough, one option is to increase the IOPS by using a set set of mirror'd disks (5x2 way in your case) and o/c to use ZIL.
Depending on the type of SSD the sync speed variance can be quite extreme, my tests yielded write speeds from 20 bis 150 MB/s depending on SSD type.
 

TheSmoker

Patron
Joined
Sep 19, 2012
Messages
225
Because you use compression the zero dds that you ran are just irrelevenant. Unles you used /dev/rand as your data source instead of /dev/zero amd not even then ...
 

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
TheSmoker: I woke up this morning and realized the same thing, so I'm going to re-run the test with compression turned off. That said, Rand: I can disable the Hyperthreading but I don't think that's going to change anything. The concerning issue is the read speeds. In an array of this size, 300MB/s is abysmal at best. And I'm not convinced yet that the ZIL is the issue, because I get the same results with and without the ZIL.

I'll disable compression and re-run all my tests today.
 

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
Here are my results with compression disabled:

Command used for all tests
dd if=/dev/zero of=/mnt/<volume>/testfile bs=1048576

Ran ON the FreeNAS local command line (shell):

SAS ZFS Volume -
Writes:
282403536896 bytes transferred in 303.451011 secs (930639631 bytes/sec) 908MB/s

Reads:
282404585472 bytes transferred in 294.875119 secs (957709102 bytes/sec) 935MB/s

SSD ZFS Volume -
Writes:
108106088448 bytes transferred in 70.191884 secs (1540150830 bytes/sec) 1.5GB/s

Reads:
108107137024 bytes transferred in 52.733197 secs (2050077432 bytes/sec) 2.02GB/s

--

Ran via iSCSI mount, directly on the ESXi host command line (shell):

SAS ZFS Volume -
Writes:
49753882624 bytes transferred in 301.315101 secs (165846275 bytes/sec) 161MB/s

Reads:
49753882624 bytes transferred in 247.254604 secs (201432723 bytes/sec) 196MB/s

SSD ZFS Volume -
Writes:
66076016640 bytes transferred in 303.147502 secs (217971949 bytes/sec) 212MB/s

Reads:
66076016640 bytes transferred in 302.660123 secs (218317639 bytes/sec) 213MB/s
 

eraser

Contributor
Joined
Jan 4, 2013
Messages
147
... directly on the ESXi host command line (shell):

You can't really trust performance numbers generated directly from the ESXi shell (I assume this is from either the console login or a SSH login). The shell has resource limits applied to it automatically so that commands run from it do not affect running Guests. (Under ESXi 5.1 the limits are under "VI Client -> Configuration -> System Resource Allocation" and should not be changed).

You'll want to run performance tests from inside a Guest VM instead.
 

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
I should have been more clear.

My 300MB/s reads and 500MB/a writes (with compression) were inside a VM Guest.

This most recent test was on the shell.

I will run Guest tests again with compression off for comparison but it was around 280MB/s.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Well there are some 10Gb optimizations you can try as well, primarily enlarge tx/rx buffers, thread distribution and so on; they result in higher speeds for me (but weirdly still larger write than read speed, haven't gotten around that one).

Not to forget are adapter settings, have you set lro, tso4, txcsum, rxcsum in your ifconfig settings? (and whatever more the x540 supports, those are from my x520)
 

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
Rand: Would you mind sharing your settings and your setup for both hardware and settings?

Would love to have a comparison setup to see where my setup is falling.

Thanks much
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Well i am on a system way less powerful than yours so i suppose you should be able to reach as least my numbers (except maybe for sync writes; without knowing your ssds).
And i am not sure those are really suitable settings; its just what i use atm without being done evaluating stability/optimization potential.

Havent run iscsi tests, from a windows host in vm (via nfs) i get ~ 300MB Reads and ~150 MB writes
The reads are in line what i usually get from my array (no idea why reads are worse than writes, maybe one of those settings but couldn't find which one); the writes are limited by the zil. Different zils, different write speeds; crosschecked with sync=always and cifs from a regular win client.

E3-1230v3, 16GB, 8x3TB 5900 Toshiba (SATA3), SSD400, intelx520-t with lro, tso4, txcsum, rxcsum. mtu 9014

tunable:
hw.intr_storm_threshold=10000

syctsl see image.

Btw, i used this as a starting point
 

Attachments

  • Sysctls.png
    Sysctls.png
    15.2 KB · Views: 336

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
Well, I'm at a loss. No matter what I do I just can't seem to get above 330MB/s using iSCSI, even though on the FreeNAS command line it gets 1.5GB/s and 2.0GB/s respectively on my SSD Volume. (Compression is off on my ZFS volumes).

I've confirmed all of the NIC settings are good, LRO, TSO4, TXCSUM, RSCSUM (and a few others the X540 supports) even verified at the switch level that my Jumbo Frames are working at 9000 MTU correctly end to end.

I'm fairly confident this isn't a network related issue, but some sort of iSCSI related throughput issue. iPerf tests good at 9.4-9.6Gb/s on both NICs, through the switch and directly connected. Port stats on the switch confirm that it's hitting 9.5+Gb/s on iPerf tests.

I also re-tested RAID1+0 and RAIDZ2 with and without ZIL (not that it should matter a whole lot since I'm using iSCSI not NFS), but without the ZIL actually proved better than with the ZIL.

I tried all of your sysctl settings and the tunable, and no change.

Hopefully someone with a bit more FreeNAS iSCSI knowledge can jump in and offer some next steps for me.
 

zambanini

Patron
Joined
Sep 11, 2013
Messages
479
I suggest that you use multipath with two nics and don't forget to set the io switching on the vsphere cli for the volumes.
 

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
I can try that but I don't see how multipath will help. I'm not even exceeding 3.2Gb/a of the NIC as is.

I ran another test of two hosts simultaneously running a benchmark and the total of both hosts equalled the same as one single host, which makes me think maybe this is the limit of this setup.

I will try multipath today. Thanks for the suggestion.
 

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
I did multi-pathing, no change in performance. I tried all variances within ESXi, Round robin worked as exected, 50% to one NIC 50% to another, but still total 300MB/s performance. I verified transfer rates on the switch and within ESXi reporting for all tests, and traffic was flowing correctly.

I'm open to other suggestions. Just as a baseline, do any of you have systems performing better than 300MB/s on an SSD volume via iSCSI?
 

eraser

Contributor
Joined
Jan 4, 2013
Messages
147
I would think that it takes quite a bit of CPU to move that much data. What does the CPU utilization on your ESXi host (and FreeNAS system) look like during your tests?
 

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
ESXi host is a 16 core box, the Guest OS is configured with 4 cores, cores are running at 3.4Ghz on v2 generation Xeons. ESXi host has 128GB of RAM, guest has 8GB allocated to it. The only guests on this box are for these tests, nothing else exists on the ESXi hosts (I have two identical units) and the FreeNAS only has this test data.

An individual CPU thread on a single dd takes about 85% CPU on the ESXi side and FreeNAS reports 12% (however, it's a 6 core box so 16.5% would be 100% of a single core, so about 80% effective CPU on a single core on FreeNAS, which makes me wonder if the iSCSI driver for FreeNAS is multi-threaded and supports multiple cores).

I can run two dd's in parallel and each runs at 85% CPU on its own thread. If I run 2 threads or even 2 separate guests at the same time, my cumulative results are the same as a single guest running by itself, that is, 2 machines, 2 different ESXi hosts, 2 different guests hitting the same FreeNAS still hits around 300MB/s. If I run 1 ESXi host, 1 FreeNAS, one Guest, it alone also hits 300MB/s.

This leads me to believe this is all on the FreeNAS side, as multipathing made no difference, it just split 150MB/s per NIC instead of 300MB/s on one, and my results are the same no matter what I do, a cap out of around 300MB/s.

Does anyone have a theoretical overhead cost between running dd samples on the local FreeNAS box and the loss via iSCSI would be, so I know what my expectation "should" be?

I've got an all SSD array, 6 drives that I've tried RAID1+0 and RAIDZ2 on and the results are all very similar, 1.5-2.0GB/s on the local box, 300MB/s via iSCSI.

My sync=standard, which may be another question. Are my local tests jaded because of how the data is being written locally vs iSCSI?
 

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
As a follow up, the disk I/O graphs on FreeNAS when I run my tests on the local shell show each individual disk hitting 200+MB/s but when I run the iSCSI test they hit around 60MB/s (which if you do the math lines up with what I'm seeing on local vs iSCSI in throughput).

So I know these drives are actually capable of hitting the speeds. I should be able to max the 10Gb link fairly easily via iSCSI because that will be the bottleneck compared to what the local shell is reporting.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You're kind of out there in lightly explored territory, sorry to say. You appear to be doing the right things and seem to have a handle on it all. My main thought is that we've always kind of known that istgt will crap out at some point with all the user/kernel syscall foo, and one thing to try might be to see if you can manually configure the kernel iscsi target subsystem and see if it does any better. I don't have any specific guidance for you there, but based on how thorough you appear to have been trying to solve this so far, I'm guessing you'll figure it out.

It would be interesting for you to experiment with NFS as well, "just to see."
 

Flash2k6

Dabbler
Joined
Mar 22, 2014
Messages
21
Thanks jgreco, I did try NFS in a rough cut test and it didn't do so great, but that said I also did a textbook attempt and didn't bother tracking results as it wasn't my longer term goal. I am in the process of setting that up currently for a similar test. I'll let you know what happens.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Hm
just ran an iscsi write test (dd) from an esxi guest ~280 MB
Same volume, iscsi with NTFS formatted disk to win7 wkst -470mb on crystal disk mark (seq) (but only 270 mb read)
 
Status
Not open for further replies.
Top