UPDATE: Poor software iSCSI and NFS latencies on Chelsio T420-CR

Jason Keller · Jun 11, 2015

After a little more analysis, I think I know what's going on with this particular test instance (and dives a bit more into VMware vSphere Hypervisor than I intended to get into here).

See this first write pass on the thin provisioned disk?
naa.6589cfc00000013cb74d8a749d31f222 - 32 - 3 0 9 0.09 4195.69 5.72 3660.68 0.18 529.27 0.58 0.01 0.58 0.00

So DQLEN is at 32 (max queue depth to the LUN), ACTV is at 3 (so 3 IOPS in flight), MBWRTN/s is about 530MB/s at a latency of 0.58ms (seen at the storage adapter device, or DAVG).

I couldn't explain what VMware was doing here before. However, looking back at this setting on the LUN...
No of outstanding IOs with competing worlds: 32

Apparently if the disk is thin provisioned, writing to new sectors does apparently invoke a system world process to act against the disk (which according to the setting above, will lower DQLEN while two worlds are both writing to it, ostensibly to avoid latency bursts). However, you can see the process can't seem to muster more than 3 in flight IOPS, causing latency and bandwidth to be low.

Now when the disk is essentially all provisioned out (all the VMFS block pointers are present and allocated, so Thick Lazy Zeroed or Thick Eager Zeroed) the system world process no longer has to act against the disk when writes occur. So this happens...

naa.6589cfc00000013cb74d8a749d31f222 - 128 - 64 0 50 0.50 2022.27 0.00 2022.27 0.00 1011.13 29.97 0.00 29.98 0.00

DQLEN sits the same as AQLEN at 128, ACTV shows 64 IOPS in flight (and at 1MB block size, ouch!) at a bandwidth of almost 1GB/s and a latency of about 30ms at the adapter (DAVG). But why did it stop at 64 when DQLEN is 128 you might ask? The answer was simplicity itself - default queue depth of a PVSCSI adapter inside a guest OS is 64. Cranking it up to 128 with kernel boot arguments produced the expected results of 128 IOPS in flight with latency consequences of 50ms.

So after all this, I guess I'm back to square one with Chelsio in trying to get some answers as to why they have such terrible latency issues with software iSCSI and NFS across their adapters (even doing something as innocuous as booting a VM incurring 96ms+ latencies). Or I might just forget about it and caution against using any of their adapters with ESXi unless you're using the full acceleration driver like I switched to already.

Important Announcement for the TrueNAS Community.

UPDATE: Poor software iSCSI and NFS latencies on Chelsio T420-CR

Jason Keller

Explorer

Similar threads