Beefy 9.2.x box not performing as expected... ESXi 5.5 host

Flash2k6 · Mar 24, 2014

I'm a little perplexed, I just did a slightly different test. Using the SSD volume the dd still yields 300MB/s type results. (This volume does 1.5GB/s writes and 2.02GB/s reads on the FreeNAS shell).

I ran this same test on my SAS volume and attained 151MB/s writes. (This volume does 990-1.02GB/s writes and similar reads).

I find it odd that proportionally it's almost identical, that is 300MB/s is to 2.0GB as 150MB/s is to 1.0GB on the other volume. When the expected result would have been 300MB/s on the SAS volume as well since it exceeds 1GB/s on the FreeNAS shell. Makes me think there is some sort of ratio based issue going on here.

I'll give NFS a try soon.

jgreco · Mar 24, 2014

That is kind of suggesting to me that there's some sort of overall latency issue. Increased I/O subsystem performance results in increased speeds, but overall there's some fixed amount of overhead.

Flash2k6 · Mar 24, 2014

I agree however the hardware is all the same. The volumes are in the same path, just different disks.

So I would expect similar results between them via iSCSI if they both perform higher locally.

Keep in mind I have two of everything, and the results are the same on both FreeNAS boxes and both hosts.

Flash2k6 · Mar 25, 2014

I was able to get some time this morning to re-configure the FreeNAS box for NFS mode.

Here are my results:

With sync=standard, 33MB/s writes (I expected this to be slow)

With sync=disabled 1.01GB/s writes... (This is what I expected from iSCSI, woo hoo finally full speed performance!).

So there you have it .. over NFS I can max the 10GB NIC and write the data... of course at the risk of having sync=disabled... using a SLOG was only slightly better than the sync=standard, mostly because the SLOG isn't much better than the entire array which is all SSD to begin with and it was a test SLOG device not a performance grade SLC device.

I did however dig a bit into some research about istgt and found that others have issues with it in high performance situations, largely attributed to it being single threaded. So I have a test I'd like to run but I don't know enough about the under-workings of FreeNAS to know the right way to do this test, so I was hoping you could assist jgreco.

I want to launch multiple instances of the istgt driver, with different config files. I want to break my array up into 4 LUNs if you will, and assign one istgt to each LUN, then stripe the LUNs via ESXi. This way if a single istgt can only handle about 300MB, if I broke it into 4 instances, I could in theory hit 1.2GB which is the max of the 10GB link and let the initiator in ESXi manage the data across the iSCSI LUNs.

How would I go about setting this up in FreeNAS? (I ask this because a few times I've tried tweaking settings via the shell, they were overwritten upon reboot, and I haven't dug deep into FreeBSD from the OS/build level to find the source files, so maybe you could point me in the right direction to save me a little time, and make sure that it persists after a reboot?)

Thanks for your help thus far, at least we have some decent answers now on the cause.

HoneyBadger · Mar 25, 2014

I admit it's been awhile since I mucked about with istgt, but if I recall correctly "one thread per LUN" is the default behavior. You might be able to get the desired results by splitting it your data into four LUNs and need no further config tuning. It might be "one thread per connection" though.

It's certainly worth the first step of splitting it into four LUNs for testing though.

(I'm a filthy heathen who uses Solaris/COMSTAR for his performance ZFS boxes. Sorry for not being more help in a high-perf iSCSI environment.)

Flash2k6 · Mar 25, 2014

I just ran the test HoneyBadger, with 4 LUNs, and the results were no different than with 1 LUN, however going on your idea of it being one thread "per connection" I then took those 4 LUNs and multipathed them so they would round-robin and guess what happened? I was able to hit just under 600MB/s, as I forced 2 connections via the multipath.

So I think you're right, and it seems my limiting factor is the single threaded nature of istgt. So now I just need to either figure out how to run multiple instances, or figure out a creative way to force each LUN to be its own "connection".

HoneyBadger · Mar 25, 2014

Thoughts here, can you post your /usr/local/etc/istgt/istgt.conf in codeblocks?

I recall reading that if you set a queue depth manually (ie: under [LogicalUnit1] for your export, set QueueDepth 32 or any non-zero value) it may shift to "one thread per LUN" behavior.

Also for my own curiosity, I didn't see it posted yet, but what kind of SSDs are you using? Seems like very solid performance potential here. High-end SLC I presume.

Flash2k6 · Mar 25, 2014

The SSDs are just standard Samsung 840 Pros, 512GB, 6 of them in the volume. Nothing fancy, and not SLC. My QueueDepth was already 128 (I tried 32, 64, and 128), no change using either.

Here is my config, with the IPs removed. /mnt/tier1 = 6 SSDs /mnt/tier2 = 10 SAS drives

Code:

[Global]
  NodeBase "iqn.2011-03.org.example.istgt"
  PidFile "/var/run/istgt.pid"
  AuthFile "/usr/local/etc/istgt/auth.conf"
  MediaDirectory /mnt
  Timeout 30
  NopInInterval 20
  MaxR2T 32
  DiscoveryAuthMethod Auto
  MaxSessions 16
  MaxConnections 8
  FirstBurstLength 65536
  MaxBurstLength 262144
  MaxRecvDataSegmentLength 262144
  MaxOutstandingR2T 16
  DefaultTime2Wait 2
  DefaultTime2Retain 60
 
[UnitControl]
 
# PortalGroup section
[PortalGroup1]
  Portal DA1 <IP Removed>:3260
 
# InitiatorGroup section
[InitiatorGroup1]
  InitiatorName "ALL"
  Netmask <IP Removed>/24
 
# LogicalUnit section
[LogicalUnit1]
  TargetName "san1target"
  Mapping PortalGroup1 InitiatorGroup1
  AuthMethod Auto
  UseDigest Auto
  ReadOnly No
  UnitType Disk
  UnitInquiry "FreeBSD" "iSCSI Disk" "0123" "002590e7873400"
  UnitOnline yes
  BlockLength 512
  QueueDepth 128
  LUN0 Storage /dev/zvol/tier2/tier2zvol auto
  LUN0 Option Serial 002590e78734000
  LUN1 Storage /mnt/tier1/tier1extent1 250GB
  LUN1 Option Serial 002590e78734001
  LUN2 Storage /mnt/tier1/tier1extent2 250GB
  LUN2 Option Serial 002590e78734002
  LUN3 Storage /mnt/tier1/tier1extent3 250GB
  LUN3 Option Serial 002590e78734003
  LUN4 Storage /mnt/tier1/tier1extent4 250GB
  LUN4 Option Serial 002590e78734004

HoneyBadger · Mar 25, 2014

Grasping at straws here, I see a

Code:

BlockLength 512

in there, which means iSCSI is expecting 512-byte chunks. Maybe a change to

Code:

BlockLength 4096

especially if you've created this zpool with 9.2+ and it's defaulted to ashift=12 (4K aligned) would help.

Have you tried blasting data at it from two concurrent physical hosts? Just wondering if it will scale up linearly with the connections here.

Also you're on 10GbE; I assume you've got jumbo frames enabled across your storage network?

cyberjock · Mar 25, 2014

You need to realize that when you start using your box there's more important aspects than throughput for your pool. As such I've not answered this thread because you're really focusing on problems that aren't going to be problems when you start using the pool. Latency is going to be what pisses you off(and potentially causes a loss of data). And pools are generally designed for throughput OR latency. Not both(unless you've got a $20k+ budget).

Flash2k6 · Mar 25, 2014

Yep, Jumbo frames enabled and confirmed working end to end, and ESXi won't support 4096 block size, it won't mount, it has to be 512. I did do a dual-host blast and the variance was under 20MB/s.

The splitting across multiple LUNs seems to be the biggest improvement to performance thus far. I removed the multipathing and still achieved 500ish MB/s, with multipathing I get closer to 600MB/s.

cyberjock: I'm aware of the differences, but right now I'm just trying to go from one extreme to the other to see what the hardware and FreeNAS can do, document it all, then dial it back into settings that give me the balance I need for my setup. I'm a bit of a tweaker when it comes to maximizing every last ounce out of my setups, and so I need to know the range of what it can/can't do.

I'd love to hear if you have any suggestions on areas to improve the current challenge.

jgreco · Mar 25, 2014

Mmm, jumbo, sometimes there are problems with jumbo because there is usually a different code path handling the frames, sometimes with a different buffer allocator. I'm too lazy to go finding my last set of test results but you might try disabling jumbo as one of your variables.

Flash2k6 · Mar 25, 2014

I did it with and without Jumbo early in my tests, no change.

Keep in mind I'm able to hit line rate with NFS with and without Jumbo.

I'm running another set of tests currently that may prove interesting. I'll circle back when it's done.

Flash2k6 · Mar 25, 2014

Good news... we've solved it. This is a CPU bounding limitation of istgt. The solution was to clone the istgt.conf and run multiple instances on different ports so that each connection would go to a different CPU core. I am now achieving full line rate performance via iSCSI to my array by using multipathing in ESXi so that data is spread across multiple ports. This is with compression turned off.

Hyperthreading in this instance is also a bad thing, because spawning a second istgt instance we found more often than not it was riding the hyperthreaded core of the CPU bound initial instance. So we split it into 4 instances for testing and then we were able to hit full line rate consistently with no istgt instance going over 70% cpu. Prior to splitting it out into 4, we found often that we were maxing 1 at 100% and a second at about 40-60%, digging further that's when we noticed the 2nd one was usually on the hyperthreaded portion.

In the end, we're able to hit full line rate with 1500 MTU AND 9000 MTU Jumbo (it made no difference), and 3 instances of istgt running across 3 ports, with no Hyperthreading.

This array is screaming fast, and very responsive. We're going to do some pretty in depth load testing as our next phase, but at least at the moment, we've got a solid performing foundation.

Thanks for the suggestions folks, you kept us pushing on the path to resolution. We'll post some more in depth test results of file/IOPS etc in the near future.

jgreco · Mar 25, 2014

Awesome! I called it right to begin with, istgt crapping out. :)

vitek · Mar 26, 2014

Flash2k6 said:
Good news... we've solved it. This is a CPU bounding limitation of istgt. The solution was to clone the istgt.conf and run multiple instances on different ports so that each connection would go to a different CPU core. I am now achieving full line rate performance via iSCSI to my array by using multipathing in ESXi so that data is spread across multiple ports. This is with compression turned off.

Hyperthreading in this instance is also a bad thing, because spawning a second istgt instance we found more often than not it was riding the hyperthreaded core of the CPU bound initial instance. So we split it into 4 instances for testing and then we were able to hit full line rate consistently with no istgt instance going over 70% cpu. Prior to splitting it out into 4, we found often that we were maxing 1 at 100% and a second at about 40-60%, digging further that's when we noticed the 2nd one was usually on the hyperthreaded portion.

In the end, we're able to hit full line rate with 1500 MTU AND 9000 MTU Jumbo (it made no difference), and 3 instances of istgt running across 3 ports, with no Hyperthreading.

This array is screaming fast, and very responsive. We're going to do some pretty in depth load testing as our next phase, but at least at the moment, we've got a solid performing foundation.

Thanks for the suggestions folks, you kept us pushing on the path to resolution. We'll post some more in depth test results of file/IOPS etc in the near future.

Can you provide a small guide to how you can setup the use of multiple istgt instances? I for one have the same issue with the cpu bound problems with the istgt.

cyberjock · Mar 28, 2014

Putting in a ticket at bugs.freenas.org might be useful for future users too.

mikebest66 · Aug 11, 2014

Flash2k6 have you put this into production? Are you happy with the results?

Flash2k6 · Aug 11, 2014

Yes, the box has been in production since mid-April and it's been performing quite well thus far. We're running just over 30 VMs using this NAS as the storage point, 3 of which are fairly decent transaction volume DB VMs. No major issues thus far, but we're data paranoid so we have the NAS replicating to another location and we're also backing up the VMs on the filesystem level (within the VMs OS that is), and will continue to do that until we have a hardware failure to see how reliable it is and how easy it is to recover from upon failure.

So far so good though...

diehard · Aug 12, 2014

Flash2k6 said:
Yes, the box has been in production since mid-April and it's been performing quite well thus far. We're running just over 30 VMs using this NAS as the storage point, 3 of which are fairly decent transaction volume DB VMs. No major issues thus far, but we're data paranoid so we have the NAS replicating to another location and we're also backing up the VMs on the filesystem level (within the VMs OS that is), and will continue to do that until we have a hardware failure to see how reliable it is and how easy it is to recover from upon failure.

So far so good though...

Are you still running without sync on? What SSD are you using as an SLOG?

Important Announcement for the TrueNAS Community.

Beefy 9.2.x box not performing as expected... ESXi 5.5 host

Dabbler

Resident Grinch

Dabbler

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Inactive Account

Dabbler

Resident Grinch

Dabbler

Dabbler

Resident Grinch

Dabbler

Inactive Account

Cadet

Dabbler

Contributor

Similar threads