10gb NFS vs iSCSI benchmarks. ESXi 6.5 with 5 IOAnalyzer workers. Any tuning suggestions?

beezel · Mar 28, 2018

I've recently setup a 12x6TB SAS pool in Raid 10 with an intel p3520 pcie nvme ZIL on a new Supermicro chassis. Ethernet is Intel x740 and x540 PCIe cards, into a Quanta LB6M switch.

In the past we've used iSCSI for hosts to connect to Freenas because we had 1gb hardware and wanted round-robin etc. Now that we're moving to 10gb we decided to test NFS vs iSCSI and see exactly what came about. Our workload is a mixture of business VMs - AD, file server, Exchange, Vendor App A, etc. There is no real ZFS tuning, only a few NIC options (as noted).

Interestingly, iSCSI performs best without Jumbo frames, and NFS seems to perform best with them enabled.

From my untuned results, it looks as though iSCSI is still the way to go for us, from a latency and random performance perspective. Since latency is so important to us, I think that is where we'll end up.

Some questions this brought about: Any way to reduce latency in general? Things REALLY spike when you do full loads, but even the more randoms have worryingly high latencies in the 20+ MS.

Does anyone have any suggestions for tuning either iSCSI or NFS? I'd also be happy to use different tests. I'm just using a few from VMware's IOAnalyzer fling, without any real design to it.

We did intially test NFS with no tuning options, as well as iSCSI, but they quickly fell behind the other results that used "rxcsum txcsum tso4 lro" so they were abandoned.

https://docs.google.com/spreadsheets/d/1J15gXMUIIYfI0xaOP7coELHh9-2CcFZv3VMUPmSlQc8/edit?usp=sharing

Thanks!

EDIT: discovered my inital latency figures were SUMs, not AVGs. I've corrected the data. Still some worrying outlyers, but no longer the 700ms times that scared me.

c32767a · Mar 28, 2018

beezel said:
Some questions this brought about: Any way to reduce latency in general? Things REALLY spike when you do full loads, but even the more randoms have worryingly high latencies in the 20+ MS.

Does anyone have any suggestions for tuning either iSCSI or NFS? I'd also be happy to use different tests. I'm just using a few from VMware's IOAnalyzer fling, without any real design to it.

Interesting data..

I'm curious what the number of NFS servers was set to in the NFS settings on the freenas side?

beezel · Mar 28, 2018

c32767a said:
Interesting data..

I'm curious what the number of NFS servers was set to in the NFS settings on the freenas side?

Default of 4. Think upping it would change anything?

c32767a · Mar 28, 2018

beezel said:
Default of 4. Think upping it would change anything?

I suspect it will if you have a lot of random I/O.

What CPU are you running in your box? We typically use 24-48 on E5-1620v2s.

zambanini · Mar 28, 2018

vmware forces synch writes on nfs servers. this is what increases latency. arc works better on nfs. so you might use a real nice slog like a optane ssd. just for testing, you could set

zfs set sync=disabled yourpoolname/dataset

the forum offers you more information about this topic

beezel · Mar 29, 2018

zambanini said:
vmware forces synch writes on nfs servers. this is what increases latency. arc works better on nfs. so you might use a real nice slog like a optane ssd. just for testing, you could set

zfs set sync=disabled yourpoolname/dataset

the forum offers you more information about this topic

Thanks for the reply, but you are a bit off. We want sync enabled, and during our iSCSI tests we used sync=always. Therefore there is no difference between iSCSI and NFS in regards to that.

The p3520 isn't quite an optane, but it's still damn fast over NVMe.

We don't want to bother testing with sync disabled since we'll never run that workload.

beezel · Mar 29, 2018

c32767a said:
I suspect it will if you have a lot of random I/O.

What CPU are you running in your box? We typically use 24-48 on E5-1620v2s.

Thanks. I had no idea it should scale so high, it says to use between 4 and 6. This is why this forum is so valuable!

We have dual six-core E5-2603 v4 with HT enabled.

I'll rerun some of the benches with those new values. I'm guessing i'll need to restart NFS server - but do you think I'll need a full reboot?

zambanini · Mar 29, 2018

beezel, why bother if you are not able to provide the details in your post. such a waste of time.

beezel · Mar 29, 2018

zambanini said:
beezel, why bother if you are not able to provide the details in your post. such a waste of time.

The details were provided in the linked data. Not my fault you didn't read but still felt the need to chime in.

beezel · Mar 29, 2018

c32767a said:
I suspect it will if you have a lot of random I/O.

What CPU are you running in your box? We typically use 24-48 on E5-1620v2s.

Still in the initial testing phase, but swapping to 48 NFS servers on our pure SSD array (4x 2TB samsung 850 pros with an Intel p3520 ZIL) shows almost a DOUBLING of IOPs and halving of latency over previous tests (iSCSI included). Before iSCSI was a marginal victor, now NFS is absolutely smashing it.

Thanks for the tip! Anything else you got up your sleeve? Any gotcha's regarding running a high server count for NFS?

c32767a · Mar 29, 2018

beezel said:
Still in the initial testing phase, but swapping to 48 NFS servers on our pure SSD array (4x 2TB samsung 850 pros with an Intel p3520 ZIL) shows almost a DOUBLING of IOPs and halving of latency over previous tests (iSCSI included). Before iSCSI was a marginal victor, now NFS is absolutely smashing it.

Thanks for the tip! Anything else you got up your sleeve? Any gotcha's regarding running a high server count for NFS?

Yeah.. the defaults are... conservative.. :) That number is basically the number of concurrent NFS threads that process IO. In bursty high-IOPS environments, it's really easy to get to a situation where all 4 threads are blocked, waiting for ZFS to service their requests. There's some art and science to tuning the number of concurrent threads. Given that you have 12 physical cores, I would consider 48 mildly aggressive, but suitable.. Tuning the number too high runs the risk of NFS using an excessive amount of CPU and RAM at the cost of other functionality.

You might consider a faster NVMe card, depending on your budget, etc. Unless I'm looking at the wrong datasheet Intel only shows about 26k for random write IOPS on that card. Your SLOG will only write a couple gigs at most, so bandwidth isn't much of a concern, but it will use every iop you can give it.
Since your volume is all SSDs and given the IOPS rating of that card, you might consider removing the slog completely and testing, just to see if the slog is holding you back at all.

One other thing you might want to look at is ZFS has some throttles to protect read performance on the cache device. Your l2arc might go faster if you play with vfs.zfs.l2arc_write_max. But the consequence is the overall write loading may crush your read performance..

You can go down the NIC card and network stack tuning rabbit hole if you want to, but I don't think it'll have any substantial effects.

Our standard ESX datastore is a raidz2 with 7 512G Samsung 850 pros. We see about 700-800MB/s on and off the disks to our ESXi machines over NFS.. I admit I've not specifically measured latency, but it's never been raised as an issue.

edit: (lol, the link suggester thingy shows 850 evos.. we use the pro. :p)
edit2, added the nfs tuning comment.

beezel · Mar 30, 2018

c32767a said:
Yeah.. the defaults are... conservative.. :) That number is basically the number of concurrent NFS threads that process IO. In bursty high-IOPS environments, it's really easy to get to a situation where all 4 threads are blocked, waiting for ZFS to service their requests. There's some art and science to tuning the number of concurrent threads. Given that you have 12 physical cores, I would consider 48 mildly aggressive, but suitable.. Tuning the number too high runs the risk of NFS using an excessive amount of CPU and RAM at the cost of other functionality.

You might consider a faster NVMe card, depending on your budget, etc. Unless I'm looking at the wrong datasheet Intel only shows about 26k for random write IOPS on that card. Your SLOG will only write a couple gigs at most, so bandwidth isn't much of a concern, but it will use every iop you can give it.
Since your volume is all SSDs and given the IOPS rating of that card, you might consider removing the slog completely and testing, just to see if the slog is holding you back at all.

One other thing you might want to look at is ZFS has some throttles to protect read performance on the cache device. Your l2arc might go faster if you play with vfs.zfs.l2arc_write_max. But the consequence is the overall write loading may crush your read performance..

You can go down the NIC card and network stack tuning rabbit hole if you want to, but I don't think it'll have any substantial effects.

Our standard ESX datastore is a raidz2 with 7 512G Samsung 850 pros. We see about 700-800MB/s on and off the disks to our ESXi machines over NFS.. I admit I've not specifically measured latency, but it's never been raised as an issue.

edit: (lol, the link suggester thingy shows 850 evos.. we use the pro. :p)
edit2, added the nfs tuning comment.

Unfortunately a supercap NVMe with higher IOPS is out of our budget. I was just happy to get such a heavy endurance card with a supercap in it. I did test the 850s without the slog and they were drastically slower - 1/4 the speed and IOPs.

This is my first NFS rodeo, and I'm curious what your solution is for path redundancy? Previously we had multiple iSCSI connections and ESXi would round robin them. Losing a link was not really a problem.

I know on the ESXi side I can just create a vSwitch with 2 adaptors in active/standby, but I am unsure how to replicate that on the ZFS side for NFS. We have 2 switches and ideally I'd like a path through each. Once again, with iSCSI this was trivial but I can't seem to find a straightforward way to do it with NFS. Maybe I'm missing something obvious?

bigphil · Mar 30, 2018

beezel said:
I know on the ESXi side I can just create a vSwitch with 2 adaptors in active/standby, but I am unsure how to replicate that on the ZFS side for NFS. We have 2 switches and ideally I'd like a path through each. Once again, with iSCSI this was trivial but I can't seem to find a straightforward way to do it with NFS. Maybe I'm missing something obvious

Several techniques to accomplish this. If you have vsphere distributed switches you can configure LACP, if you have standard vswitches then you can use static etherchannel. Keep in mind that to support either of these methods across multiple switches, they'll need to be stackable or whatever method your switch vendor supports to create one managed entity. This, in combination with configuring RSTP and esxi host switch ports configured with portfast, would provide aggregation and path redundancy. Not as simple as MPIO with iSCSI, but not hard to do and there are tons of guides on the web about it. Another thing to think about if you do decide to use NFS is that with FreeNAS, NFS doesn't support hardware acceleration (VAAI), where iSCSI does (*all features if using device based zvol extent). You can also use active/standby uplinks on ESXi and FreeNAS if your network doesn't support the above setup or isn't complex enough to require the additional config.

beezel · Mar 30, 2018

bigphil said:
Several techniques to accomplish this. If you have vsphere distributed switches you can configure LACP, if you have standard vswitches then you can use static etherchannel. Keep in mind that to support either of these methods across multiple switches, they'll need to be stackable or whatever method your switch vendor supports to create one managed entity. This, in combination with configuring RSTP and esxi host switch ports configured with portfast, would provide aggregation and path redundancy. Not as simple as MPIO with iSCSI, but not hard to do and there are tons of guides on the web about it. Another thing to think about if you do decide to use NFS is that with FreeNAS, NFS doesn't support hardware acceleration (VAAI), where iSCSI does (*all features if using device based zvol extent). You can also use active/standby uplinks on ESXi and FreeNAS if your network doesn't support the above setup or isn't complex enough to require the additional config.

Awesome info, thanks. I think our physical switches are going to be the main problem. We went cheap and got Quanta LB6M which do not appear stackable. This wasn't an issue as we designed it around iSCSI, but now it's looking to be a problem.

I would be fine with a simple active/passive failover scenario. Any suggestions on where to look or what to google, term wise? I understand how to do that on the ESXi side, but not FreeNAS. Would that just be creating a virtual interface basically?

c32767a · Mar 30, 2018

beezel said:
Unfortunately a supercap NVMe with higher IOPS is out of our budget. I was just happy to get such a heavy endurance card with a supercap in it. I did test the 850s without the slog and they were drastically slower - 1/4 the speed and IOPs.

This is my first NFS rodeo, and I'm curious what your solution is for path redundancy? Previously we had multiple iSCSI connections and ESXi would round robin them. Losing a link was not really a problem.

I know on the ESXi side I can just create a vSwitch with 2 adaptors in active/standby, but I am unsure how to replicate that on the ZFS side for NFS. We have 2 switches and ideally I'd like a path through each. Once again, with iSCSI this was trivial but I can't seem to find a straightforward way to do it with NFS. Maybe I'm missing something obvious?

Yeah. All the really cool hardware is too expensive.. VAAI on netapp is shiny, but we muddle along without it in our distributed pods.

Our switches support Multilink trunks (vPC in cisco-ese) so even though the 2 switches are managed independently, we create an LACP link aggregate on FreeNAS and that provides 2 paths into the network.. On the ESX side it basically looks like what @bigphil described above: ESX distributed switch to LACP to a vPC on the switches.

If you can't do LACP, you can select failover when you create the Link aggregate, which should produce the effect you want. if one switch fails the NIC connected to the other switch should take over.

Ekhaskel · Sep 29, 2019

It was very interesting to see the results of your real tests.
I read a lot about the fantastic speed of various systems, but I have never been able to get anything like this on my own tests. Your results are also far from advertised and close to what I get. .

I would like to draw your attention to the fact that when setting up a round robin on two paths to iSCSI target +

esxcli storage nmp psp roundrobin deviceconfig set -d <disk ID> -t iops -I 1

gives a noticeable performance boost, followed my own tests.

Could you please clarify about the parameters

ix = rxcsum txcsum tso4 lro

Where exactly and how do you configure them? Does this need to be configured on the ESXI side also or only on the FreeNAS side?

In conclusion, I want to note that I have been trying for a long time and without much success to increase the performance of the iSCSI .
I specifically test not on ZFS volumes, but on hardware raid disks and compare the speed obtained directly from the disk and through the iSCSI.
Unfortunately, in all my tests, the iSCSI degrades the speed by about 40% -60% compared to the drive directly.

Important Announcement for the TrueNAS Community.

10gb NFS vs iSCSI benchmarks. ESXi 6.5 with 5 IOAnalyzer workers. Any tuning suggestions?

beezel

Dabbler

c32767a

Patron

beezel

Dabbler

c32767a

Patron

zambanini

Patron

beezel

Dabbler

beezel

Dabbler

zambanini

Patron

beezel

Dabbler

beezel

Dabbler

c32767a

Patron

beezel

Dabbler

bigphil

Patron

beezel

Dabbler

c32767a

Patron

Ekhaskel

Cadet

Similar threads