High-Speed Windows Client Performance Tuning

High-Speed Windows Client Performance Tuning

There's a burgeoning trend I've noticed over the past few years in this community. That same trend exists for both home users and business use cases. That trend is the idea of tuning for single-client performance is becoming more important to folks in general. Video Production houses are number one on that list, with HomeLab nuts like myself coming in a close second.

Having multiple NVME drives in a storage pool you are connecting to over ethernet comes with many performance considerations, variables and nuance. This is a common trend across the industry, transcending OS and sharing protocol boundaries, but each with their own fiddly optimizations.

We can learn a thing or two from the successes and failures of LinusTechTips (amongst other YouTubers, shoutout to LevelOneTechs) over the past few years. This is kinda ironic, because the only breadcrumbs I can find about this topic are VLOG style videos about them solving their own IT problems.
Here's a few:
Lots of good hardware and software information here.

2 years later, lessons learned, and better hardware resulted in this video, and Features TrueNAS SCALE:


Enter my testing today;
I wanted to focus my eye on the client-side of this conversation. I wanted to shift the narrative a bit to encompass the problem as a whole. When talking about NAS, we have alot of variables in play in the external environment. Before attaching storage from an external system, we should be confident in the performance of the layers in between.

Besides, there is plenty of information to be had about server side network tuning already:

Some official iXsystems information can be found here:

That being said, I set up the following test platforms:
System 1:
AMD Ryzen 7900X3D
64 GB DDR5 6000
Mellanox ConnectX-4 100 Gb (Windows Reports PCI-E Gen3, 16x, in PCIEX16_2)
Win 11

System 2:
AMD Ryzen 3700X
32 GB DDR4 3200
Intel XL710BM2 40Gb (Windows Reports PCI-E Gen3, 4x)
Win 11

Lets consider for a moment the hardware configuration here. In particular, PCI-E bus topology MATTERS here.
In the example of system 1, the motherboard has this information:
1696690939805.png


But it's not quite that simple. Ryzen 7000 has a maximum of 24 PCI-E 5 lanes total, which would mean the motherboard has got to shuffle the deck quite a bit for all of this connectivity to work.

Asus does not share a good block diagram of this board. But based on the X670 block diagram we can infer that both x16 slots do go directly to the CPU. So it seems then, that I should theoretically have full PCI-E Gen3 X16 speed for the card in that system.
3983a676b61bc500615e45c11804b3.jpg



System 2 is even worse off. Despite windows report PCI-E Gen3, it is indeed going through the chipset. So while we have 32Gb/s of potential performance here, it's shared among other devices, we are going to be performing even less that that.
1696690759247.png


6d520e6a954924bf624fccd3ba6d97.jpg



Networking

Layer 1:

We have here a simple PtP connection between two windows hosts. I re-ran my testing with a few different optics so as to ensure consistency. Despite some minor run-to-run variance, very little difference was observed between them:

Under the hood, a 40 Gigabit connection is basically 4 bonded 10 gig links in a single cable
https://jira.slac.stanford.edu/secure/attachment/23373/8023ba-2010.pdf
1696692560067.png



Layer 3:
Basic configuration:
1696692053382.png
1696692074472.png


Initial iPerf, single threaded:
Code:
C:\Users\nickf.FUSCO\Downloads\iperf-3.1.3-win64\iperf-3.1.3-win64>iperf3 -c 10.10.11.9
Connecting to host 10.10.11.9, port 5201
[  4] local 10.10.11.10 port 42726 connected to 10.10.11.9 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   816 MBytes  6.85 Gbits/sec
[  4]   1.00-2.00   sec   886 MBytes  7.44 Gbits/sec
[  4]   2.00-3.00   sec   865 MBytes  7.25 Gbits/sec
[  4]   3.00-4.00   sec   867 MBytes  7.27 Gbits/sec
[  4]   4.00-5.00   sec   876 MBytes  7.35 Gbits/sec
[  4]   5.00-6.00   sec   900 MBytes  7.55 Gbits/sec
[  4]   6.00-7.00   sec   888 MBytes  7.45 Gbits/sec
[  4]   7.00-8.00   sec   916 MBytes  7.68 Gbits/sec
[  4]   8.00-9.00   sec   873 MBytes  7.32 Gbits/sec
[  4]   9.00-10.00  sec   882 MBytes  7.40 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec  8.56 GBytes  7.36 Gbits/sec                  sender
[  4]   0.00-10.00  sec  8.56 GBytes  7.36 Gbits/sec                  receiver


The single-threaded nature of iPerf testing results in what may sound like lack-luster performance. Really, 40 and 100 gigabit were designed to scale out to a wide number of clients. Whereas 10 and 25 Gigabit have simpler PHY designs that may be easier to tune for single client performance.

But, when running 4 threads of iPerf, I didn't see much of an umprovement:
Code:
C:\Users\nickf.FUSCO\Downloads\iperf-3.1.3-win64\iperf-3.1.3-win64>iperf3 -c 10.10.11.10 -P 4
Connecting to host 10.10.11.10, port 5201
[  4] local 10.10.11.9 port 52520 connected to 10.10.11.10 port 5201
[  6] local 10.10.11.9 port 52521 connected to 10.10.11.10 port 5201
[  8] local 10.10.11.9 port 52522 connected to 10.10.11.10 port 5201
[ 10] local 10.10.11.9 port 52523 connected to 10.10.11.10 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   318 MBytes  2.67 Gbits/sec
[  6]   0.00-1.00   sec   315 MBytes  2.65 Gbits/sec
[  8]   0.00-1.00   sec   313 MBytes  2.63 Gbits/sec
[ 10]   0.00-1.00   sec   307 MBytes  2.58 Gbits/sec
[SUM]   0.00-1.00   sec  1.22 GBytes  10.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec   354 MBytes  2.97 Gbits/sec
[  6]   1.00-2.00   sec   351 MBytes  2.95 Gbits/sec
[  8]   1.00-2.00   sec   348 MBytes  2.92 Gbits/sec
[ 10]   1.00-2.00   sec   346 MBytes  2.90 Gbits/sec
[SUM]   1.00-2.00   sec  1.37 GBytes  11.7 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec   334 MBytes  2.80 Gbits/sec
[  6]   2.00-3.00   sec   332 MBytes  2.78 Gbits/sec
[  8]   2.00-3.00   sec   329 MBytes  2.76 Gbits/sec
[ 10]   2.00-3.00   sec   327 MBytes  2.75 Gbits/sec
[SUM]   2.00-3.00   sec  1.29 GBytes  11.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec   339 MBytes  2.84 Gbits/sec
[  6]   3.00-4.00   sec   336 MBytes  2.82 Gbits/sec
[  8]   3.00-4.00   sec   334 MBytes  2.80 Gbits/sec
[ 10]   3.00-4.00   sec   332 MBytes  2.78 Gbits/sec
[SUM]   3.00-4.00   sec  1.31 GBytes  11.2 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec   360 MBytes  3.02 Gbits/sec
[  6]   4.00-5.00   sec   358 MBytes  3.00 Gbits/sec
[  8]   4.00-5.00   sec   355 MBytes  2.98 Gbits/sec
[ 10]   4.00-5.00   sec   353 MBytes  2.96 Gbits/sec
[SUM]   4.00-5.00   sec  1.39 GBytes  12.0 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec   316 MBytes  2.65 Gbits/sec
[  6]   5.00-6.00   sec   313 MBytes  2.63 Gbits/sec
[  8]   5.00-6.00   sec   311 MBytes  2.61 Gbits/sec
[ 10]   5.00-6.00   sec   310 MBytes  2.60 Gbits/sec
[SUM]   5.00-6.00   sec  1.22 GBytes  10.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec   311 MBytes  2.61 Gbits/sec
[  6]   6.00-7.00   sec   308 MBytes  2.59 Gbits/sec
[  8]   6.00-7.00   sec   306 MBytes  2.56 Gbits/sec
[ 10]   6.00-7.00   sec   305 MBytes  2.56 Gbits/sec
[SUM]   6.00-7.00   sec  1.20 GBytes  10.3 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec   326 MBytes  2.73 Gbits/sec
[  6]   7.00-8.00   sec   324 MBytes  2.72 Gbits/sec
[  8]   7.00-8.00   sec   322 MBytes  2.70 Gbits/sec
[ 10]   7.00-8.00   sec   321 MBytes  2.69 Gbits/sec
[SUM]   7.00-8.00   sec  1.26 GBytes  10.9 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec   324 MBytes  2.72 Gbits/sec
[  6]   8.00-9.00   sec   323 MBytes  2.71 Gbits/sec
[  8]   8.00-9.00   sec   320 MBytes  2.69 Gbits/sec
[ 10]   8.00-9.00   sec   319 MBytes  2.67 Gbits/sec
[SUM]   8.00-9.00   sec  1.26 GBytes  10.8 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec   328 MBytes  2.75 Gbits/sec
[  6]   9.00-10.00  sec   326 MBytes  2.73 Gbits/sec
[  8]   9.00-10.00  sec   324 MBytes  2.71 Gbits/sec
[ 10]   9.00-10.00  sec   322 MBytes  2.70 Gbits/sec
[SUM]   9.00-10.00  sec  1.27 GBytes  10.9 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec  3.23 GBytes  2.78 Gbits/sec                  sender
[  4]   0.00-10.00  sec  3.23 GBytes  2.78 Gbits/sec                  receiver
[  6]   0.00-10.00  sec  3.21 GBytes  2.76 Gbits/sec                  sender
[  6]   0.00-10.00  sec  3.21 GBytes  2.76 Gbits/sec                  receiver
[  8]   0.00-10.00  sec  3.19 GBytes  2.74 Gbits/sec                  sender
[  8]   0.00-10.00  sec  3.19 GBytes  2.74 Gbits/sec                  receiver
[ 10]   0.00-10.00  sec  3.17 GBytes  2.72 Gbits/sec                  sender
[ 10]   0.00-10.00  sec  3.17 GBytes  2.72 Gbits/sec                  receiver
[SUM]   0.00-10.00  sec  12.8 GBytes  11.0 Gbits/sec                  sender
[SUM]   0.00-10.00  sec  12.8 GBytes  11.0 Gbits/sec                  receiver

iperf Done.




Taking a Peek at the Ether:
1696694894966.png


Using wireshark to capture data on the interface during an iPerf run, we have about 220,000 packets.
Filtering for some indicators of poor TCP performance:
Code:
tcp.analysis.out_of_order || tcp.analysis.duplicate_ack
|| tcp.analysis.ack_lost_segment || tcp.analysis.retransmission || tcp.analysis.lost_segment || tcp.analysis.zero_window || tcp.analysis.fast_retransmission || tcp.analysis.spurious_retransmission || tcp.analysis.keep_alive


1696694959606.png


13% of all traffic is being "wasted", so some additional tweaking may yield some additional performance, but it may also be that the inherent platform bottleneck on "System 2" is causing a cascading negative affect into the TCP stack.

Tuning:
I've found several resources here to help me move along this process:

In my testing, in this environment, I found the best mix of performance between these two systems was with the following settings. The same settings were applied to both systems.
  • Toying around with buffers, it seems that the best speeds I was able to achieve were with 2048 Send and 1024 Receive. Increasing the buffer sizes beyond these values decreased performance for me, but YMMV.
    1696693846450.png
    1696693876966.png
  • Jumbo Frames to 9014 did seem to help, but in doing so I have statically increased the number of issues in the TCP stack.
    1696693765082.png
  • Interrupt Morderation Disabled made the biggest performance difference.
1696693716473.png



I found that in my testing, some settings did not properly revert back to defaults when I expected them to. The only way I was able to reset the NICs back to default properly was to uninstall/reinstall them in Device Manager:
1696693691751.png


Code:
C:\Users\nickf.FUSCO\Downloads\iperf-3.1.3-win64\iperf-3.1.3-win64>iperf3 -c 10.10.11.10
Connecting to host 10.10.11.10, port 5201
[  4] local 10.10.11.9 port 34248 connected to 10.10.11.10 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  1.54 GBytes  13.2 Gbits/sec
[  4]   1.00-2.00   sec  1.49 GBytes  12.8 Gbits/sec
[  4]   2.00-3.00   sec  1.30 GBytes  11.2 Gbits/sec
[  4]   3.00-4.00   sec  1.27 GBytes  10.9 Gbits/sec
[  4]   4.00-5.00   sec  1.33 GBytes  11.5 Gbits/sec
[  4]   5.00-6.00   sec  1.30 GBytes  11.2 Gbits/sec
[  4]   6.00-7.00   sec  1.27 GBytes  10.9 Gbits/sec
[  4]   7.00-8.00   sec  1.26 GBytes  10.8 Gbits/sec
[  4]   8.00-9.00   sec  1.57 GBytes  13.5 Gbits/sec
[  4]   9.00-10.00  sec  1.36 GBytes  11.7 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec  13.7 GBytes  11.8 Gbits/sec                  sender
[  4]   0.00-10.00  sec  13.7 GBytes  11.8 Gbits/sec                  receiver

iperf Done.

C:\Users\nickf.FUSCO\Downloads\iperf-3.1.3-win64\iperf-3.1.3-win64>


Now when I scale out to multiple threads:
Code:
C:\Users\nickf.FUSCO\Downloads\iperf-3.1.3-win64\iperf-3.1.3-win64>iperf3 -c 10.10.11.10 -P 4
Connecting to host 10.10.11.10, port 5201
[  4] local 10.10.11.9 port 52119 connected to 10.10.11.10 port 5201
[  6] local 10.10.11.9 port 52120 connected to 10.10.11.10 port 5201
[  8] local 10.10.11.9 port 52121 connected to 10.10.11.10 port 5201
[ 10] local 10.10.11.9 port 52122 connected to 10.10.11.10 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   494 MBytes  4.15 Gbits/sec
[  6]   0.00-1.00   sec   495 MBytes  4.15 Gbits/sec
[  8]   0.00-1.00   sec   494 MBytes  4.14 Gbits/sec
[ 10]   0.00-1.00   sec   476 MBytes  3.99 Gbits/sec
[SUM]   0.00-1.00   sec  1.91 GBytes  16.4 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec   537 MBytes  4.50 Gbits/sec
[  6]   1.00-2.00   sec   540 MBytes  4.53 Gbits/sec
[  8]   1.00-2.00   sec   530 MBytes  4.44 Gbits/sec
[ 10]   1.00-2.00   sec   505 MBytes  4.24 Gbits/sec
[SUM]   1.00-2.00   sec  2.06 GBytes  17.7 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec   442 MBytes  3.71 Gbits/sec
[  6]   2.00-3.00   sec   455 MBytes  3.82 Gbits/sec
[  8]   2.00-3.00   sec   456 MBytes  3.82 Gbits/sec
[ 10]   2.00-3.00   sec   447 MBytes  3.75 Gbits/sec
[SUM]   2.00-3.00   sec  1.76 GBytes  15.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec   482 MBytes  4.04 Gbits/sec
[  6]   3.00-4.00   sec   480 MBytes  4.02 Gbits/sec
[  8]   3.00-4.00   sec   474 MBytes  3.98 Gbits/sec
[ 10]   3.00-4.00   sec   465 MBytes  3.90 Gbits/sec
[SUM]   3.00-4.00   sec  1.86 GBytes  16.0 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec   477 MBytes  4.00 Gbits/sec
[  6]   4.00-5.00   sec   477 MBytes  4.00 Gbits/sec
[  8]   4.00-5.00   sec   471 MBytes  3.95 Gbits/sec
[ 10]   4.00-5.00   sec   452 MBytes  3.79 Gbits/sec
[SUM]   4.00-5.00   sec  1.83 GBytes  15.8 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec   473 MBytes  3.97 Gbits/sec
[  6]   5.00-6.00   sec   469 MBytes  3.94 Gbits/sec
[  8]   5.00-6.00   sec   463 MBytes  3.89 Gbits/sec
[ 10]   5.00-6.00   sec   444 MBytes  3.73 Gbits/sec
[SUM]   5.00-6.00   sec  1.81 GBytes  15.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec   463 MBytes  3.88 Gbits/sec
[  6]   6.00-7.00   sec   491 MBytes  4.12 Gbits/sec
[  8]   6.00-7.00   sec   484 MBytes  4.06 Gbits/sec
[ 10]   6.00-7.00   sec   465 MBytes  3.90 Gbits/sec
[SUM]   6.00-7.00   sec  1.86 GBytes  16.0 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec   419 MBytes  3.52 Gbits/sec
[  6]   7.00-8.00   sec   488 MBytes  4.09 Gbits/sec
[  8]   7.00-8.00   sec   498 MBytes  4.18 Gbits/sec
[ 10]   7.00-8.00   sec   462 MBytes  3.87 Gbits/sec
[SUM]   7.00-8.00   sec  1.82 GBytes  15.7 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec   453 MBytes  3.80 Gbits/sec
[  6]   8.00-9.00   sec   478 MBytes  4.01 Gbits/sec
[  8]   8.00-9.00   sec   479 MBytes  4.02 Gbits/sec
[ 10]   8.00-9.00   sec   480 MBytes  4.02 Gbits/sec
[SUM]   8.00-9.00   sec  1.84 GBytes  15.8 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec   430 MBytes  3.61 Gbits/sec
[  6]   9.00-10.00  sec   450 MBytes  3.78 Gbits/sec
[  8]   9.00-10.00  sec   474 MBytes  3.98 Gbits/sec
[ 10]   9.00-10.00  sec   446 MBytes  3.74 Gbits/sec
[SUM]   9.00-10.00  sec  1.76 GBytes  15.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec  4.56 GBytes  3.92 Gbits/sec                  sender
[  4]   0.00-10.00  sec  4.56 GBytes  3.92 Gbits/sec                  receiver
[  6]   0.00-10.00  sec  4.71 GBytes  4.05 Gbits/sec                  sender
[  6]   0.00-10.00  sec  4.71 GBytes  4.05 Gbits/sec                  receiver
[  8]   0.00-10.00  sec  4.71 GBytes  4.05 Gbits/sec                  sender
[  8]   0.00-10.00  sec  4.71 GBytes  4.05 Gbits/sec                  receiver
[ 10]   0.00-10.00  sec  4.53 GBytes  3.89 Gbits/sec                  sender
[ 10]   0.00-10.00  sec  4.53 GBytes  3.89 Gbits/sec                  receiver
[SUM]   0.00-10.00  sec  18.5 GBytes  15.9 Gbits/sec                  sender
[SUM]   0.00-10.00  sec  18.5 GBytes  15.9 Gbits/sec                  receiver

iperf Done.

C:\Users\nickf.FUSCO\Downloads\iperf-3.1.3-win64\iperf-3.1.3-win64>


With these settings alone, I saw a 58% single stream performance improvement and a 74% multi-stream performance improvement, and yet we are still no where near the maximum performance possible of the network card itself.

Re-running Wireshark after applying my settings:
There are 215,201 packets
1696695753226.png


1696695785716.png


And 44,883 packets are indicating potential poor performance still, which is actually higher than original at 20%!!

It seems I have hit the wall of this test platform, and would have to get a beefyer test environment to scale further.

  • Setting auto-tune for TCP to any value other than "Normal" resulted in performance degradation.
    Set-NetTCPSetting -AutoTuningLevelLocal Normal
  • Disabling Flow Control and QoS resulted in performance degradation.
  • Changing RSS settings may have some performance to unlock too, but in my testing so far all of the variations I have tried resulted in performance degradation.
Author
NickF
Views
2,613
First release
Last update
Rating
4.00 star(s) 1 ratings

More resources from NickF

Latest reviews

Nice read. Having similar problems although on a smaller scale and got some nice tips out of here.
Top