SOLVED KVM: 200% CPU usage with High Network Traffic from/between Virtual Machines

mroptman

Dabbler
Joined
Dec 2, 2019
Messages
23
Testing to see if Scale 22.02 can replace ESXi - the glaring issue discovered is increased host CPU usage while VMs experience high network load. Testing was done with iperf3, SFTP/SMB file transfers to a physical Linux desktop to VMs running on Scale. When network load is put on VMs, two host CPU cores spike to 200% usage until the network activity drops off. The VMs use VirtIO network adapter. Tests were also done with E1000 adapter and no significant change to CPU usage was observed. The big increase to CPU usage during network traffic is a real blow.

There's no difference in host CPU usage if it's VM to VM (same host, using the bridge) OR if the traffic is VM -> to gigabit physical machine. In either situation the same 2 CPUs spike to 200% usage until high network traffic stops.

Testing Speeds - As expected
VM to VM (via bridge): 16.9 Gbps
VM to Host (via bridge): 20.5 Gbps
VM to Physical, 1G physical: 935 Mbps

I very much want Scale to replace ESXi - but this extra CPU hit with KVM is a showstopper. There may be an issue with hardware offload as when the same hardware is booted to ESXi, CPU usage is as expected (network load does not cause a CPU spike).

Is anyone else experiencing similar issues?
Should there be hardware offload?
Did I make a configuration goof? :)

Any help would be much appreciated.
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
It's an interesting symptom and difference, but not necessarily an issue. It might be that the linux networking just uses a single Core until it is overloaded..
I'd be interested in the difference in transfer speeds between KVM VMs and ESXi VMs. If the transfer speeds are lower, that would be an issue and an opportunity for improvement.
 

mroptman

Dabbler
Joined
Dec 2, 2019
Messages
23
Thanks for the reply - that sparked the drive to run the same tests on ESXi, which should have been done originally.

ESXi 6.7.0 - iperf3 speed tests:
VM to VM (via vSwitch): 18.4 Gbps
VM to Host (iperf3 not available or could not find the binary)
VM to physical: 934 Mbps - was not expecting this to be any different. Too cheap for home 10G...

I also noticed that ESXi places CPU usage of a VM against the VM itself. However on TrueNAS with qemu - network CPU does not appear against the VM. When running the initial tests on Scale in the original post, the CPU usage noticed was not shown when observing top within the VM. The extra CPU usage was only shown when looking at top on the Scale host.

Noticed a small VM to VM boost (18.4 Gbps ESXi- 16.9 Gbps Scale KVM). A small/negligible difference of 1.5 Gbps gain with ESXi. However the CPU usage for network transfer shows against the ESXi VM, which was not noticed before. Appears qemu networking uses free host CPU to process network transfer instead of using assigned VM CPU, which is kinda cool actually! Maxed out network is not normal usage anyways. Perhaps the most network load would be first time snapshot replication, but replication would be against the host which does appear to be doing proper hardware offload and leaves plenty of host CPU available.

Looks like Scale is looking promising to replace ESXi the humble home lab; was just taken by surprise with how CPU usage is reported in high-network traffic scenarios on Scale vs ESXi.

Summary of Differences:
  • ESXi VMs record network usage against the VM that is doing the network transfer, not the host
  • Scale/Qemu VMs record network usage against free host CPU, allowing VMs to use all assigned CPU for VM tasks (a neat discovery)
    • The host provides CPU for network processing
  • Network traffic against a bare metal Scale host does not cause any CPU spikes; hardware offloading is working
  • Qemu appears to not support hardware offloading for network usage when using LACP bond to physical with a bridge for the VMs
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Thanks for the reply - that sparked the drive to run the same tests on ESXi, which should have been done originally.

Summary of Differences:
  • ESXi VMs record network usage against the VM that is doing the network transfer, not the host
  • Scale/Qemu VMs record network usage against free host CPU, allowing VMs to use all assigned CPU for VM tasks (a neat discovery)
    • The host provides CPU for network processing
  • Network traffic against a bare metal Scale host does not cause any CPU spikes; hardware offloading is working
  • Qemu appears to not support hardware offloading for network usage when using LACP bond to physical with a bridge for the VMs

Great post.. and thanks for the analysis.
 
Top