UPDATE: Poor software iSCSI and NFS latencies on Chelsio T420-CR

Jason Keller · Apr 6, 2015

Darren Myers said:
So i'm using 2 ESXi hosts one with a Qlogic 10Gb card and the other with the same chelsio T420-CR card, i installed the drivers for the chelsio card upon installation of ESXi 5.5 u2 because the card wasnt seen what so ever. Are you saying youre getting better performance from your T420 when using the T520 iSCSI driver instead? I have yet to test the speed of either of my cards but i might just carve out some time to do that tomorrow.

EDIT: the "Chelsio Full Offload-iSCSI Initiator Driver v1.0.0.0 for ESXi5.5 " driver?

Correct (note that I am also using FreeNAS with a Chelsio T420-CR in the storage head). I am using both the Chelsio network driver (latest cxgb4) and the csiostor (Chelsio Full Offload-iSCSI Initiator) driver. This will make it appear both as a network adapter, and hardware independent iSCSI HBA under storage adapters.

Darren Myers · Apr 6, 2015

Yea i am using a T420-CR card in my FreeNAS and one of my ESXi servers. I'll install the iSCSI initiator next

Jason Keller · Apr 8, 2015

Just got this from Chelsio...

Greetings Jason,

It was nice talking to you.
The latency related issue faced by you while writing data from ESXi to FreeNAS seems like an expected behaviour.

Please check if this VMware KB helps:
http://kb.vmware.com/selfservice/mi...nguage=en_US&cmd=displayKC&externalId=1002598

Please let us know if you need any other information.

Thanks and Regards,
Ashutosh
[Chelsio Support Team]

jgreco · Apr 8, 2015

That mostly seems like a blowoff to bounce the issue back to your lap. Of course you can configure around it somehow.

cyberjock · Apr 8, 2015

A 64k block size for the zvol is going to be potentially crippling for ZFS. I'd leave it at the default (16k) unless you are 100% sure that your workloads and writes will be big. Hint: The file system updates in the VMs will still be very small (<1KB), so you are pretty much guaranteed to cripple the performance).

Also, writes *will* be bursty. They are cached in RAM, so any writes go to RAM and are acknowledged immediately. So until you actually fill the write cache in RAM, you can basically "write" to the zpool as fast as your network allows.

But I bet if you change your zvol to 16KB blocks you'll find things are back to normal. There's a reason why we choose 16KB as the default. It works extremely well for 99% of situations. ;)

Jason Keller · Apr 21, 2015

I did actually switch back to 16KB blocks. I'm playing with different blocksizes as there are a litany of different recommendations out there (like Oracle's 64KB block size recommendation, likely to help stave off flooding L2ARC references for their mammoth 1.6TB L2ARC devices). So far 16KB does seem to be working well, though I appear to be disk capped (I have hand-me-down crappy disks, and I see several of my flock with millions/billions of delayed ECCs in SMART). I will likely need to go purely solid state to see what it will really do.

I have also received a couple additional follow up emails from Chelsio (slowly). One of them turned me on to the fact that both drivers (cxgb4 and csiostor) are attempting to dynamically flash firmware on the card by default, the network firmware being older (and this is the firmware the card ends up on during operation). They suggested that I disable cxgb4 and check throughput and behavior of csiostor with it disabled. Upon doing so, as I expected the firmware is now at parity with what csiostor was trying to flash it to.

For others to check theirs...

Code:

cat /proc/scsi/csiostor/<<scsi#>>

and to make sure cxgb4 doesn't flash firmware dynamically on load...and check the result

Code:

esxcli system module parameters set -m cxgb4 -p t4_fw_install=0
esxcli system module parameters list --module cxgb4

Jason Keller · May 31, 2015

I know it's been a while since I've posted, but a lot has happened/changed in recent times. I've gone 100% solid state (4x Samsung 850 EVO 1TB in 2 x 2 formation) and all my spinners are gone out of the main pool. I've also received some additional follow-up from Chelsio about once every other week or so, after a couple months of back and forth like this we're finally headed toward a screen sharing session. But I have some really interesting data to throw out there as I still seem to have a vexing issue...write latency. Before anyone says it...(sync=disabled).

The following is from an Ubuntu VM, PVSCSI controller to an RDM of a memdisk passed directly to CTL as an extent for the iSCSI tests. NFS I used a VMDK on the NFS datastore (again memory backed, but had ZFS on top of it with all caching disabled but LZ4 left on).

These throughput tests were done with a single DD thread copying 1MB blocks from /dev/zero to the target. I ran iozone -a against them as well, with equally shocking DAVG results.

20G RAMdisk

NFS
write throughput = 925MB/s
read throughput = 540MB/s
DAVG/write = ~96ms

iSCSI (ESXi Software)
write throughput = 1.1GB/s
read throughput = 550MB/s
DAVG/write = ~96ms

iSCSI (Hardware offload driver)
write throughput = 1.1GB/s
read throughput = 640MB/s
DAVG/write = ~28ms
DAVG/read = ~0.28ms

So, after all that I can only say WTF. The DAVG/read is roughly in line with what pings look like between host/storage. We have read throughput half of write throughput, and write latency is 100x what the read latency is. I've even tried multiple threads of DD on read/write and it makes zero difference. CPU never passes 12% on the FreeNAS box.

I'm hoping this sparks something out there. As there are no doubt many people using the ESXi Software iSCSI adapter and NFS in 10Gbit environments with great success, I'm hard pressed as to what is going on here. At least I have concrete, specific data now directly out of DRAM (so there can be no question of whether or not the disks are interfering).

Jason Keller · Jun 2, 2015

And more interesting results from tonight's gigabit trials...mind you this is all from a 20GB /dev/md1 mem disk on the FreeNAS side pumped directly into CTL as a block extent, passed via PVSCSI PRDM into an Ubuntu 14 VM. I've re-run and included the 10G numbers for comparison.

BNX NC382i Hardware Dependent iSCSI

Reads:
DAVG/cmd 2.01
MBREAD/s 111
READS/s 880-890

Writes:
DAVG/cmd 285.26
MBWRITE/s 112
WRITES/s 224

T420-CR HW iSCSI

Reads:
DAVG/cmd 0.28
MBREAD/s 600
READS/s 4850

Writes:
DAVG/cmd 27.3
MBWRITE/s 1090
WRITES/s 2183

I have never in all my life witnessed such horrendous iSCSI latencies from a ramdisk. This is readily reproducible on both hosts, and I did not have this issue on FreeNAS via 2x BNX NC382i links when I was using the G6 as storage, leading me to believe something is seriously amiss on the current FreeNAS head (an IBM X3650 M3). Anyone with ideas - shout out please.

zambanini · Jun 2, 2015

I suggest you post a bug entry

jgreco · Jun 2, 2015

So, that's what, Westmere-EP kit? Hm. Your results seem kinda baddish, yeah. Wonder if @mav@ would have a comment.

Jason Keller · Jun 3, 2015

Just a heads up that I've now tried these tests using my HP DL380 G6 as the target, with exactly the same results. I'm currently writing up a bug report.

jgreco · Jun 4, 2015

Please post the bug number here when you do.

Jason Keller · Jun 4, 2015

Bug #10049

jgreco · Jun 4, 2015

Also feel free to keep this thread updated with anything you learn or any responses to the bug report.

Ericloewe · Jun 4, 2015

Jason Keller said:
Bug #10049

We passed 10 000 bugs already? Phew.

jgreco · Jun 4, 2015

we always knew that it was a buggy product, somebody should tell jkh [emoji16]

cyberjock · Jun 4, 2015

So if you are on the FreeNAS server and you ping the IP of the ESXi host, what latency do you get? What about vice versa?

What about if, from the FreeNAS server you ping a VMs IP? What about vice versa?

Jason Keller · Jun 4, 2015

Both ways the round trip latency is the same - hovers anywhere between 0.17ms to 0.34ms between the VMkernel interface and the FreeNAS interface.

Jason Keller · Jun 7, 2015

I've gone ahead and disabled modules cxgb4 and csiostor on one of my ESXi hosts and linked it again over gigabit...

DEVICE PATH/WORLD/PARTITION DQLEN WQLEN ACTV QUED %USD LOAD CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd
naa.6589cfc00000013cb74d8a749d31f222 - 32 - 2 0 6 0.06 683.45 1.92 594.79 0.06 86.43 3.55 0.00 3.55 0.00

naa.6589cfc00000013cb74d8a749d31f222 - 64 - 2 0 3 0.03 893.17 892.69 0.00 111.59 0.00 2.09 0.00 2.10 0.00

While the throughput is a bit lower on writes, the latency is a much, much better 3.55ms writes / 2.09ms reads (and still pegging the wire on reads at 111MB/s). I also noticed that DQLEN is solid at 32 under load now in esxtop.

And also...

~ #esxcli storage core device list
naa.6589cfc00000013cb74d8a749d31f222
Display Name: FreeBSD iSCSI Disk (naa.6589cfc00000013cb74d8a749d31f222)
Has Settable Display Name: true
Size: 2097152
Device Type: Direct-Access
Multipath Plugin: NMP
Devfs Path: /vmfs/devices/disks/naa.6589cfc00000013cb74d8a749d31f222
Vendor: FreeBSD
Model: iSCSI Disk
Revision: 0123
SCSI Level: 6
Is Pseudo: false
Status: degraded
Is RDM Capable: true
Is Local: false
Is Removable: false
Is SSD: true
Is Offline: false
Is Perennially Reserved: false
Queue Full Sample Size: 0
Queue Full Threshold: 0
Thin Provisioning Status: yes
Attached Filters:
VAAI Status: supported
Other UIDs: vml.0100000000356366336663346331366330303030695343534920
Is Local SAS Device: false
Is USB: false
Is Boot USB Device: false
No of outstanding IOs with competing worlds: 32

This output is the same on both hosts. However, on the one with the Chelsio card drivers still installed...

This is the first write pass...
naa.6589cfc00000013cb74d8a749d31f222 - 32 - 3 0 9 0.09 4195.69 5.72 3660.68 0.18 529.27 0.58 0.01 0.58 0.00

Notice that it only goes around 530MB/s with a latency of 0.58ms. But now on the second pass...
naa.6589cfc00000013cb74d8a749d31f222 - 128 - 64 0 50 0.50 2022.27 0.00 2022.27 0.00 1011.13 29.97 0.00 29.98 0.00

See what happened with DQLEN? It didn't throttle down to 32 (it stays at 128). And on gigabit, to start with DQLEN was 64. Bandwidth goes up to almost 1GB/s and latencies went way, way up to 30ms. So I rerun my gigabit test with my now zero-full disk and...

naa.6589cfc00000013cb74d8a749d31f222 - 64 - 64 1 100 1.02 224.90 0.00 224.42 0.00 112.21 284.61 0.60 285.21 0.57

Low and behold, DQLEN sticks, throughput goes up and write latency skyrockets.

So, in sum this issue appears to be happening whenever VMware is overwriting a block of data in a VMDK that is already allocated. Strangely I do not see this occur on local storage. SIOC is not licensed and is consequently not enabled on any LUNs.

Jason Keller · Jun 7, 2015

After additional testing and number crunching, this appears to be an effect of Ubuntu (which apparently, no matter what I seem to do, only seems to like writing in 1MB blocks to the PVSCSI driver) and the random nature in which VMware casually handles their I/O queuing.

For new writes to a thin provisioned volume, vSphere Hypervisor appears to throttle the writes to the LUN to a queue depth of 32 (but only about 3 in flight; resulting, as you see above, in lower throughput but sweet latency). However, on overwrites of a block in VMFS, vSphere isn't throttling anything by default at all, instantly filling the queue up with 1MB writes (which at a default depth of 64 on PVSCSI, creates around the 27-30ms of latency I am seeing, which if my math is correct when all done in parallel would equate to around that much latency across a 10 gigabit link). Same goes for my gigabit results (which explains why they so nicely line up at a power of 10 of each other). Adding a second link in multipath muted about 10ms of latency (with IOPS=1 RR tuning; due to having an additional link this spread the 64 queue into two 32s, which lowered throughput but decreased latency; bringing me to 19-20ms) and added about 200MB/s worth of additional write bandwidth, but that was all. If I unleashed the PVSCSI default queue in the guest OS to fill the 128 the driver is set for I'm pretty sure I might have gotten closer to 2GB/s but would still hit the wall at 28ms for latency; or around 50ms if you fill the 128 deep queue on a single link.

After all that, read bandwidth I can only surmise is that low due to for some reasons Linux, even with multiple DD threads running and on metal, is potentially only issuing the reads to the array at QD=1. In ESXTOP, I never see any queuing at all on reads which makes me awfully suspicious of this.

Now, by all means I'm not saying there isn't a problem (random 48ms write latency spikes from vCenter while your environment is idle is difficult to explain, even in the face of all this, as is the 96ms+ NFS and software iSCSI latency just attempting to boot and use a single low-rent Windows VM). But one moral of the story here I think is that when someone suggests you run tests of this nature to see what results you get during troubleshooting, you may be extremely surprised that even going all solid-state for storage, plus the suggested 10G cards, and all the DRAM and CPU in the world can't help your latency when you fill up your transport.

Important Announcement for the TrueNAS Community.

UPDATE: Poor software iSCSI and NFS latencies on Chelsio T420-CR

Explorer

Guru

Explorer

Resident Grinch

Inactive Account

Explorer

Explorer

Explorer

Patron

Resident Grinch

Explorer

Resident Grinch

Explorer

Resident Grinch

Server Wrangler

Resident Grinch

Inactive Account

Explorer

Explorer

Explorer

Similar threads