40GbE Performance

Elliott

Dabbler
Joined
Sep 13, 2019
Messages
40
I'm testing a FreeNAS machine with 40GbE and right out of the box I have great performance with iperf. With multiple threads I get steady 39.5Gbps.
Unfortunately a single SMB client maxes out a CPU core around 10Gbps. CPU is Xeon Gold 5127 @ 3GHz. This is probably the best we can do, but it will scale with multiple clients.
NFSv4 throughput is quite bouncy, averaging 12Gbps and jumping up to 20Gb at times on one client. I'm hoping I can get more out of this with some tweaking. How can I increase NFS rsize and wsize? When I try setting them larger on the client, they revert to 128K.
I set NFS to run 32 servers to match vCPU count; should I try higher? For some reason htop only shows 2 nfsd processes, but all 32 cores are bouncing on the graphs. Not sure why. What is the FreeBSD equivalent to /proc/net/rpc/nfsd to see performance statistics? Ideally it would be really cool to put a graph of this in the web GUI.

With NFS server set to async, I can get 15Gb read and 19Gb write. I don't want to do this in production though. I benchmarked the pool locally at 2500MBps (20Gb) read and write to the spindles, not "cheating" with the ARC. So why does NFS sync slow it down? AIUI setting NFS to async with ZFS sync=standard means writes bypass the ZIL. I have SLOG on a NVMe mirror, maybe I should remove this if it's actually slower than the pool...
 

Elliott

Dabbler
Joined
Sep 13, 2019
Messages
40
Not yet. I had to put this project on hold for a little while due to other priorities. I was hoping some FreeNAS gurus could help answer these questions.
 

radambe

Dabbler
Joined
Nov 23, 2014
Messages
10
Does anyone out there have any suggestions beyond the basic sysctl tunables, zpool layout, etc that are easily found via google to make FreeNAS serve NFS and/or SMB via 25/40/100 GbE with performance similar to what can be easily accomplished with the same hardware running Centos, Windows 10, or even Mac OSX?
 
Joined
Dec 29, 2014
Messages
1,135
This is what the tunables look like in my primary FreeNAS
1573135840525.png

After making some major changes to the hardware, I disabled autotune, deleted all tunable settings, turned autotune back on, and rebooted. You can see where I made some manual changes as well. The dirty_data_max was to attempt to make more use of my Optane 900P SLOG. I disabled the firewalls after I put in the 40G cards because they seemed to be throttling my iperf3 tests. Server to server or same server to a loopback interface, the highest I could get was brushing up again 27G throughput. I suspect it is CPU, memory speed, bus speed, or some other kind of limitation on my system. I just could never figure out what it was. That said, it does sound a bit like crocodile tears crying about 27G across a network link. I wish I knew how to tell what the theoretical max through is for my system. About the only thing left I could upgrade is the memory speed. The E5-2637 v2 @ 3.50GHz CPU is the fastest one I could find for this generation of server, and all the hyperthreading options are off since it is a bare metal box. The memory is running at 1600 Mhz. I think there is DDR3 ECC memory that runs at 1866 Mhz, but I don't know how much that would get me, especially for the $$$ it would consume. I think I have about maxed out this platform, but I would expect it to do the job for me for several more years at least.
 

radambe

Dabbler
Joined
Nov 23, 2014
Messages
10
After making some major changes to the hardware, I disabled autotune, deleted all tunable settings, turned autotune back on, and rebooted. You can see where I made some manual changes as well. The dirty_data_max was to attempt to make more use of my Optane 900P SLOG. I disabled the firewalls after I put in the 40G cards because they seemed to be throttling my iperf3 tests. Server to server or same server to a loopback interface, the highest I could get was brushing up again 27G throughput. I suspect it is CPU, memory speed, bus speed, or some other kind of limitation on my system. I just could never figure out what it was. That said, it does sound a bit like crocodile tears crying about 27G across a network link. I wish I knew how to tell what the theoretical max through is for my system. About the only thing left I could upgrade is the memory speed. The E5-2637 v2 @ 3.50GHz CPU is the fastest one I could find for this generation of server, and all the hyperthreading options are off since it is a bare metal box. The memory is running at 1600 Mhz. I think there is DDR3 ECC memory that runs at 1866 Mhz, but I don't know how much that would get me, especially for the $$$ it would consume. I think I have about maxed out this platform, but I would expect it to do the job for me for several more years at least.

Thanks Elliot so much for posting this.

Just to clarify, are you saying that you are still seeing a 27Gbps iperf3 throughput ceiling? Or are you saying that you disabled firewalls and this 27G limit went away?

What firewalls are you referring to, and how does one go about disabling them? (Does FreeNAS have some sort of built in firewall that's enabled by default and potentially creating a bottleneck for ethernet connections?)

If you are in fact saying that you are still seeing a 27Gbps limit to your iperf3 test results, I would find this extremely interesting. This intrigues me as I recently built and deployed our studio's first FreeNAS system and after lots of wasted time and frustration found that the "highly recommended" Chelsio NIC's I was using were at the very least a MAJOR cause of the very similar performance limits I was running into!

After weeks of bashing my head against this FreeNAS server, trying no less than 6 different operating systems on both client and server, trying multiple replacement cables and even replacing the NIC's several times, I found that the Chelsio 62100 dual port 100GbE NIC's I was using were simply downright INCAPABLE of doing anything above almost exactly 27Gbps!!! And that was a best case scenario. Most of the time, I never saw much more than 12 - 13 Gbps through those 62100's using iperf or iperf3. Actual file transfer performance was even worse. Like significantly shittier than cheap 10GbE.

As soon as I pulled all the Chelsio NIC's and physically removed them from the building just to be safe, and then replaced those with Mellanox 40GbE NIC's, I immediately started getting 39.5Gbps iperf and iperf3 test results between the server and my testbench client. These are very cheap Mellanox ConnectX-3 VPI's (dual port qsfp) that I picked up on ebay for ~ $35 each just to have something to test against the Chelsio's that had given me so much grief.

My server specs are very similar to yours. I'd be very curious to know what would happen if you pulled the T580's and replaced them with something equivalent from Mellanox like these ConnectX3's. I went through four of the Chelsio 62100 cards and they all acted the same. PM me if you'd like - perhaps I could let you borrow a pair of each of these and see if you find the same results.

My server is built on a supermicro x10dri-t4+ with dual e5v3 6core 2.4ghz xeon's and 128GB of 2400mhz ram that the mobo seems to run at 1867mhz for some reason. This system lives in a matching supermicro 36bay chassis with 24x Seagate Enterprise 6TB SATA spinners (3x 8drive raidz2 vdev's) + 10x Samsung Evo 850 1TB Sata SSD's as the L2ARC. Testbench client system is a similar spec X9 based system with 8core v2 xeon's and 128gb ddr3 (tested with windows 10 for workstations, centos 7.x, freebsd 11, freebsd 12, ubuntu, the list goes on).
 
Joined
Dec 29, 2014
Messages
1,135
When I did the test that gave me 27g throughput, there no firewalls traversed. Both the FreeNAS boxes are on the same IP network connected to the same Nexus switch. I had great luck with the 10g Chelsio T-520 cards, but perhaps the 40g T-580 ones are not as good.
 
Joined
Dec 29, 2014
Messages
1,135
are you saying that you disabled firewalls
Sorry, missed this in the earlier reply. I used a sysctl setting to turn off the FreeNAS/FreeBSD firewall. I don't remember the setting, and I am traveling at the moment.
 

radambe

Dabbler
Joined
Nov 23, 2014
Messages
10
When I did the test that gave me 27g throughput, there no firewalls traversed. Both the FreeNAS boxes are on the same IP network connected to the same Nexus switch. I had great luck with the 10g Chelsio T-520 cards, but perhaps the 40g T-580 ones are not as good.

And again, yet another commonality to my experience. I too have had fantastic luck with Chelsio T520-BT 10GbE NIC's, and across several operating systems. My experience has been the exact opposite with the Chelsio 62100 100GbE NIC's.
 
Joined
Dec 29, 2014
Messages
1,135
I wonder if there are any other FreeNAS users with 40G or above out there. I definitely found that setting net.inet.tcp.blackhole=2 and net.inet.udp.blackhole=1 helped, but it wasn't a huge jolt. The Intel 10G cards seem to be better supported now. Before it seemed it was Chelsio all the way or you were kind of on the edge. I may also try changing to jumbo frames, but just on the storage NIC. My T580's are in PCIe 3.0 slots (forget if they are x8 or x16), so I am really not sure what else I could do to boost performance. My feeling is that some part of my hardware (possibly outside of the NIC) is a limiting factor. I upgraded the CPU in my secondary FreeNAS to match the Dual E5-2637 v2 @ 3.50GHz ones I have in the primary. I did get a little more out of it then. It still appeared to have the same ceiling as the primary. I just got back from vacation, so I haven't put too much time into it. I am certainly willing to replace/upgrade some stuff, but I don't want to waste money doing it.

Edit: Here are the specs on both from my sig.
Primary:
Cisco C240 M3S - 24 x 2.5" SAS/SATA drive bays
Dual E5-2637 v2 @ 3.50GHz
256 GB ECC DRAM
4 x Intel i350 Gigabit NIC for management (using 2 ports in LACP channel)
Chelsio T580-CR dual port 40 Gigabit NIC for storage network (using 1 port)
LSI 9207-8I controller
System pool = single vdev of mirrored 300G 10k SAS drives
Storage pool = 2 RAIDZ2 vdevs of 8 x 1TB 7.2k SATA drives with one spare drive
Intel Optane 900P SLOG
LSI 9207-8E for future external expansion
Boots from internal SD cards in HW RAID config

Secondary:
Cisco C240 M3S - 24 x 2.5" SAS/SATA drive bays
Dual E5-2637 v2 @ 3.50GHz
128 GB ECC DRAM
4 x Intel i350 Gigabit NIC for management (using 2 ports in LACP channel)
Chelsio T580-CR dual port 40 Gigabit NIC for storage network (using 1 port)
LSI 9207-8I controller
System pool (internal) = single vdev of mirrored 300G 10k SAS drives
LSI 9207-8E connected to HP D2700 external enclosure - 25 x 2.5" SAS/SATA drive bays
Storage pool (D2700) = 4 RAIDZ2 vdevs of 6 x 300GB 10k SAS drives with one spare drive
Intel Optane 900P SLOG (shared device, separate partition per pool)
Storage pool 2 (D2600) = 2 RAIDZ2 vdevs of 6 x 2TB 7.2k SAS drives with one spare drive (internal)
Intel Optane 900P SLOG (shared device, separate partition per pool)
Boots from internal SD cards in HW RAID config
 
Last edited:

Rand

Guru
Joined
Dec 30, 2013
Messages
906
I do have a bunch of 62100's but have not tested them in FN to be honest; my 100G tests have been pushed back due to the realization that my pool is not yet capable of pushing anything close to this at desired QD/threads.

Could you share your drive type/pool setup/local performance values?
I have been focussing on mirrors but scaling has been very bad for me (https://www.ixsystems.com/community/threads/pool-performance-scaling-at-1j-qd1.80417/) so getting a set of known to work configs would provide me a valuable secondary data point.
I can then mimic your setup to see if I have a hw issue and if not I am happy to provide you with secondary/tertiary data point on your target setup :)

I also have some MLX 100G/FDR cards if you want me to run tests with those :)
 

radambe

Dabbler
Joined
Nov 23, 2014
Messages
10
I would be very curious to see if you run into the same performance limitations with the Chelsio 62100's that I have seen. Again, simply swapping the Chelsio 62100 out for an old Mellanox ConnectX-3 VPI completely erased the bandwidth limit. I was never able to get iperf test results above maybe 30Gb maximum using 62100's directly connected between two supermicro dual e5 xeon machines. I tested at least 5 or 6 different operating systems in addition to the latest release of FreeNAS.

I will post my server config to this thread shortly. I cannot currently re-install the 62100 as this server is now in production running the old Mellanox card for its 40G connection (and intel x540 for 10G), but I can run some fresh iperf and iperf3 tests and post the results. If anyone has a specific command or iperf test w/arguments that they'd like me to run, just go ahead and post the command.

I currently have a supermicro 7047GR-TRF gpu workstation running Centos 7.6 directly connected to the FreeNAS server's Mellanox 40G card. The workstation is using the same Mellanox. They are directly connected via an active optical qsfp+ cable.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
I see you run " 36bay chassis with 24x Seagate Enterprise 6TB SATA spinners (3x 8drive raidz2 vdev's) "

What kind of workload are you running to test this? Streaming reads? Please provide blocksize/recordsize & QD/Threads so I can mimic.
What performance level did you reach with CX3 vs the Chelsios (vs 10GbE)?
 

ronclark

Dabbler
Joined
Dec 5, 2017
Messages
40
This is what the tunables look like in my primary FreeNAS
View attachment 33750
After making some major changes to the hardware, I disabled autotune, deleted all tunable settings, turned autotune back on, and rebooted. You can see where I made some manual changes as well. The dirty_data_max was to attempt to make more use of my Optane 900P SLOG. I disabled the firewalls after I put in the 40G cards because they seemed to be throttling my iperf3 tests. Server to server or same server to a loopback interface, the highest I could get was brushing up again 27G throughput. I suspect it is CPU, memory speed, bus speed, or some other kind of limitation on my system. I just could never figure out what it was. That said, it does sound a bit like crocodile tears crying about 27G across a network link. I wish I knew how to tell what the theoretical max through is for my system. About the only thing left I could upgrade is the memory speed. The E5-2637 v2 @ 3.50GHz CPU is the fastest one I could find for this generation of server, and all the hyperthreading options are off since it is a bare metal box. The memory is running at 1600 Mhz. I think there is DDR3 ECC memory that runs at 1866 Mhz, but I don't know how much that would get me, especially for the $$$ it would consume. I think I have about maxed out this platform, but I would expect it to do the job for me for several more years at least.

I am confused, everything i have read say don't use autotone unless your system is low on system resources.

second thing is from Googling around all net.inet.tcp.blackhole=2 and net.inet.udp.blackhole=1 does is stop port scaning so how does that help network transfers?

DESCRIPTION
The blackhole sysctl(8) MIB is used to control system behaviour when con-
nection requests are received on SCTP, TCP, or UDP ports where there is
no socket listening.

The blackhole behaviour is useful to slow down an attacker who is port-
scanning a system in an attempt to detect vulnerable services. It might
also slow down an attempted denial of service attack.

https://www.freebsd.org/cgi/man.cgi?query=blackhole
 
Joined
Dec 29, 2014
Messages
1,135
second thing is from Googling around all net.inet.tcp.blackhole=2 and net.inet.udp.blackhole=1 does is stop port scaning so how does that help network transfers?
I am not entirely sure why, but the tunables I mentioned helped my iperf numbers. Only my FreeNAS boxes have 40G cards. The ESXi servers all have 10G cards. The 40G cards was a "because I can" nerd kind of thing. I got a good deal on eBay for some used Nexus switches that had more 10G ports ( and a more familiar interface) than old HP (really re-badged H3C) switches.
I am confused, everything i have read say don't use autotone unless your system is low on system resources.
Perhaps. FreeNAS meets most of my needs without me having to get too crazy on the tuning. I am sure a lot of that is helped by throwing a bunch of RAM at it. I have tuned a few parameters to try and get more use out of the Optane 900P SLOG devices, but that was about it.
Could you share your drive type/pool setup/local performance values?
My primary FreeNAS has a primary pool made up of the 2 RAIDZ2 vdevs. Each vdev has 8 ST91000640NS 7200 rpm SATA drives. I share that through NFS to my ESXi hosts, and sometimes to the secondary FreeNAS when I am backing up files. I used NFS because I am more comfortable with it as a cranky old Unix guy, plus I like being able to move things around within the file system as opposed to the big block files for iSCSI. The NFS write from ESXi was a problem until I added the Optane SLOG. Now I can get pretty sustained 4.2G writes from the ESXi boxes which meets my needs.
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
So i was just* WHAAAT* when I read 4.2G and wondered how the ... you get >4GB/s with a few spinners.
Then I realized that you must mean 4.2GBit which are some 300-350 MB/s which is more in line with what I'd expect :)

That of course means that running 40G is pure fun (as you mentioned) :)
I am happy to run a few iperf tests with different cards to see what I can get; but due to the low actual pool performance this is not a priority for me at this point (I originally looked into 100G to scale up single thread performance; on 40G cards this is potentially limited by architecture [40G being kind of a HW LAG of 4 10G connections], on 100G this is 4x25G)
 

ronclark

Dabbler
Joined
Dec 5, 2017
Messages
40
I am not entirely sure why, but the tunables I mentioned helped my iperf numbers. Only my FreeNAS boxes have 40G cards. The ESXi servers all have 10G cards. The 40G cards was a "because I can" nerd kind of thing. I got a good deal on eBay for some used Nexus switches that had more 10G ports ( and a more familiar interface) than old HP (really re-badged H3C) switches.

I have 40g cards too, no 10g card just because there were a steal on eBay. I have point to point setup, but I just picked up a ICX 6610 with 2 40 ports, 16 10Gports and 48 1G POE ports, so I'll be redoing my network setup. so I am always looking for tweaks to speed things up. the switch is complete overkill, but a steal of a deal.

so I'll have my freenas on one 40G port, one VM box on the other 40G port, my workstation on a 10G port so QSFP to sfp+, the rest of the network on the 1G ports. I cant wait to see how this will all come together.

I need to figure out vlans, this switch supports mixed frames on vlans. so my thought is maybe setup up a storage vlan with jumbo frames, just not sure how I'll server the rest of my network on the same 40G port
 
Joined
Dec 29, 2014
Messages
1,135
Then I realized that you must mean 4.2GBit which are some 300-350 MB/s which is more in line with what I'd expect
I didn't specify small b versus big B because it seemed implied. :smile:
I am happy to run a few iperf tests with different cards to see what I can get; but due to the low actual pool performance this is not a priority for me at this point
Funny thing, people frequently assume that the network is the slowest part of the chain but it isn't always the case. You are only as fast as your slowest component and drives, CPU, RAM, and even file system can start to play into that. The biggest thing for me was the SLOG. I was barely able to get 1Gb (note small b) write speed prior to that because of ESXi NFS synch writes. The Optane NVMe card made a huge improvement, and I am pretty happy with that. Every once in a while I consider trying to wring more of it. I am a little disappointed that I didn't see much improvement with the 40G connections, but that may be a limitation of my hardware that is now 2 generations back. I haven't played with jumbo frames on that, but I don't know how much difference that would make for me. I have a negative bias towards jumbo frames based on work experience and people not understanding that the entire path has to support the jumbo frames. It wouldn't be an issue in my network since I am just pushing that through a single switch, but I still have that baggage.
40G being kind of a HW LAG of 4 10G connections
Yes, I had to change some setting in my switch because its default was to use the 40G as 4x10G. One of the reasons I think there is something at play with the hardware is that I get roughly the same numbers having the FreeNAS connect to itself via the loopback interface. Once I saw that, I stopped as it seemed like that was some kind of hardware wall I was running into. I haven't been able to figure exactly what that is.
I need to figure out vlans, this switch supports mixed frames on vlans. so my thought is maybe setup up a storage vlan with jumbo frames, just not sure how I'll server the rest of my network on the same 40G port
I would say a storage VLAN is a good thing. You can certainly drive that off a single NIC using VLAN's. On my VM hosts I use a separate NIC for storage, a separate one for Vmotion, and then a LAG of 1G NIC's for the VM guests. All entirely overkill, but I enjoy the tinkering.
 

ronclark

Dabbler
Joined
Dec 5, 2017
Messages
40
I have a negative bias towards jumbo frames based on work experience and people not understanding that the entire path has to support the jumbo frames. It wouldn't be an issue in my network since I am just pushing that through a single switch, but I still have that baggage.

I dont like jumbo frames myself, but since i was doing point to point i tried it and it did make a difference in throughput.



I would say a storage VLAN is a good thing. You can certainly drive that off a single NIC using VLAN's. On my VM hosts I use a separate NIC for storage, a separate one for Vmotion, and then a LAG of 1G NIC's for the VM guests. All entirely overkill, but I enjoy the tinkering

more stuff to tinker with, as long as I dont break my whole network to too long. My new switch from my reading can do line speed vlans, so no slow downs there.
I need to learn how to LAG, I have had such bad luck with that. it works then it stops and becomes a huge pain. so I gave up, which is sad since my R710's have 4 1G ports. With the new switch I can run both ports on my 40G cards if I want too since all three of cards are dual ports.
 
Joined
Dec 29, 2014
Messages
1,135
I need to learn how to LAG, I have had such bad luck with that. it works then it stops and becomes a huge pain. so I gave up
I have done LAG from FreeNAS to HP 2800 switches, H3C S5800, and various flavors of Cisco IOS and Nexus switches without an issue. I always use LACP, so perhaps that is part of it. I used to be in the forced etherchannel/LAG camp, but I am no longer. Just remember that it is load balancing, NOT bonding. That means that any one conversation can't get any more bandwidth than that of a single physical link. That is good if you are trying to service multiple clients, but doesn't help you if you are trying to max out the speed of a backup or replication type job.
 
Top