New build, problematic performance

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Good point on the CPU clock. It is within the realm of possibility, since writes are buffered in RAM and then offloaded all at once after each TXG, minimizing work. But it still doesn't make sense, an Avoton is much slower per GHz and it's not that much higher-clocked and it would eat this workload for dinner.
Since RAIDZ seems to be fine, it tends to point at something that happens with multiple vdevs - but again, hundreds of people run pools like this on FreeBSD, so I'm at a bit of a loss.
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
Okay, at least that's consistent. Do you get any messages while the benchmark is running?

Is the firmware on the HBA up to date? I'll try to think of other avenues of investigation.

Hi Eric, HBA is at the latest version. Thanks for your help with this.
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
While you've got a low-clocked CPU, if that was the root of the issue I think that would have choked your RAIDZ2 vdev reads as well. And you've done this with compression off as well, so there's even less to get in the way from that front.

The fact that just changing from Z2 to mirrors seems to be enough to trash your performance is what's got me really confused. It's normal to have lower sequential performance but not that much lower.

I hate to say it, but have you tried that reinstall yet? Or even (caution, heresy ahead) a different OS with ZFS still in play?

The DD testing itself didn't seem to really cause too much problem with the CPU... I'm currently testing a NIC/iSCSI tests on the vdev RaidZ2 and the CPU is showing more signs of being hammered.

Do you think I should be disabling hyper-threading?
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
I overlooked the clock speed on the CPU. What does CPU utilization look like during the test? The low clock could also affect network traffic when it comes to that.

During DD testing, it's not showing any issues - though I do have hyper-threading on, I've read that this can cause some problems but on some early testing I tried with them both on and off and didn't see any real difference.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The DD testing itself didn't seem to really cause too much problem with the CPU... I'm currently testing a NIC/iSCSI tests on the vdev RaidZ2 and the CPU is showing more signs of being hammered.

Do you think I should be disabling hyper-threading?
I've got HT disabled on my hex-core but it's also an X5650 so it's way behind in the IPC department.

Hail-Mary suggestion; try setting the tunable
Code:
vfs.zfs.compressed_arc_enabled=0
on boot time and see if that changes anything. But I'd still hesitant to put the blame entirely on "slow CPU" because your RAIDZ2 numbers are much better (about 10x looking at that graph). I still want to chase something in the software, or maybe a firmware bug. HBA is running the IT firmware, not IR, right? No chance of the controller somehow trying to do its own RAID thing?
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
I've got HT disabled on my hex-core but it's also an X5650 so it's way behind in the IPC department.

Hail-Mary suggestion; try setting the tunable
Code:
vfs.zfs.compressed_arc_enabled=0
on boot time and see if that changes anything. But I'd still hesitant to put the blame entirely on "slow CPU" because your RAIDZ2 numbers are much better (about 10x looking at that graph). I still want to chase something in the software, or maybe a firmware bug. HBA is running the IT firmware, not IR, right? No chance of the controller somehow trying to do its own RAID thing?

The HBA is a Broadcom 9305-16i which isn't a RAID card so it's already in IT mode as such? Unless someone wants to correct me?
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
Have just tried a simple copy test and it's slightly slower with hyper-threading, though the CPU graph seemed a little less messy.

Also with some realworld testing with iscsi won't go above about 6gbit which is a bit of a headache for me. But it is a bit of a secondary issue.

Have downloaded v9.10 so will definitely reinstall and benchmark again.



Sent from my HTC One M9 using Tapatalk
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
the CPU is showing more signs of being hammered.

Do you think I should be disabling hyper-threading?
I think it is the low clock speed, nothing to do with hyper-threading. Some things are single threaded, so a higher clock speed is important, but there are plenty of other tasks to keep the other cores busy.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Have downloaded v9.10 so will definitely reinstall and benchmark again.
If you're already on 11 and planning a reinstall, give the tunable I mentioned above (uncompressed ARC) a shot. It shouldn't do anything because you're already running without compression ... but still.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Come to think of it, the uncompressed code path sees a lot less testing these days with compressed ARC... Another thing to try is to turn on compression and then run your normal workload.
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
I've been off grid for a while, still trying to get my head around things.

Compression - this has only been left off during DD testing just to see how the drivers perform overall. This was just to see if the system as ok in general or not. However, during real-world testing I've been turning it on after resetting the dataset.

... I did say I would re-install with 9.x but this failed as the NIC drivers weren't available during that version, so I had no way to control the system at all. So I just put back v11.

CPU - now this is where things could get interesting, the the CPU will Turbo Boost to 3ghz!? https://ark.intel.com/products/123544/Intel-Xeon-Silver-4108-Processor-11M-Cache-1_80-GHz - how do I check for boost?

I did some testing using a 2 x Z2 arrays (8 disks in each). I've attached a screenshot of a real-world transfer of a 104gb file over iSCSI ("104gb file - iSCSI.png") - the file was transferred from a workstation using an M.2 drive, you'll see the the CPUs are not being hit hard at all, these numbers stay
pretty consistent. Disk utilisation did seem very high though ("104gb file - iSCSI - HDDs.png") which was a little odd.

This next test was reading it back to the M.2 over iSCSI... "104gb - transfer back to M2 - HDD info.png" and "104gb - transfer back to M2 - CPU info.png"... I even took a screenshot of the LAGG NICs (2 x 10GBe) "104gb - transfer back to M2 - NIC info.png", you can see we did 2 x write and 1 x read tests. The read is so much slower! FYI, we get the same sort of stuff when using single NIC/port and we've tried to just swap the cables incase there was a cable issue.

We then tested again, using a fresh CIFS/SMB. "104gb file - SMB - CPU info.png" - you can see that one core is around 60%, but that's really about it and while watching it "live" it did really shift up much more overall.

I'd have thought, with overall spec of the system that a CPU issue would have seen a single core being maxed out during these tests? The whole system gets way worse if we transfer over lots of 4mb files (DNG files) over either iSCSI or SMB, but the CPU still remains the same. I've attached a screenshot of the NIC during the SMB testing "104gb file - SMB - NIC info.png", we only did one write and one read, so it's the last two peaks on the graphs.

So I'm still missing something: reads are still slowing than writes, CPU is not really being stressed especially during the reads so I can't say for sure it's a CPU issues at all.... How can I tell if Turbo Boost is working and/or how I can I enable it?
 

Attachments

  • 104gb file - iSCSI.png
    104gb file - iSCSI.png
    207.9 KB · Views: 315
  • 104gb file - iSCSI - HDDs.png
    104gb file - iSCSI - HDDs.png
    277.6 KB · Views: 348
  • 104gb - transfer back to M2 - CPU info.png
    104gb - transfer back to M2 - CPU info.png
    282.8 KB · Views: 324
  • 104gb - transfer back to M2 - HDD info.png
    104gb - transfer back to M2 - HDD info.png
    217.6 KB · Views: 307
  • 104gb - transfer back to M2 - NIC info.png
    104gb - transfer back to M2 - NIC info.png
    183.7 KB · Views: 308
  • 104gb file - SMB - CPU info.png
    104gb file - SMB - CPU info.png
    323.5 KB · Views: 363
  • 104gb file - SMB - NIC info.png
    104gb file - SMB - NIC info.png
    210.8 KB · Views: 325

rvassar

Guru
Joined
May 2, 2018
Messages
972
I'm not familiar with the newer Xeon's but this processor appears to have two UPI interfaces at 9.6GT/sec. Where I'm confused is how that translates to PCIe 3.0 lanes. The dual 10GBE need x8 PCIe lanes. That disk controller also needs x8 PCIe lanes. But what isn't clear to me is how those UPI channels become PCIe lanes. I pulled the Supermicro manual, and page 19, figure 1-5 shows two PCIe 3.0 x16 slots, one PCIe x8 slot, and then two PCIe x8 lanes to the "PCH", or motherboard chipset, with the two 10GbE interfaces. The PCIe v3.0 spec yields only 8GT/sec, and 985MB/sec per lane, or ~7.8GB/sec for an x8 slot.

I suspect you're hitting an internal bottleneck, either on the PCH, or the UPI interfaces. Maybe not in terms of theoretical raw speed, but once you factor in overhead...
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
If you have a dual port 10GbE card, and a free PCIe slot, you might try configuring it to use the card, and not use the on-board 10GbE. The on-board 10GbE shares it's PCIe lanes with the M.2 interface, on-board SATA, USB, etc... Any activity on any of those takes away bandwidth to the on-board 10GbE ports. Like I said I'm not entirely sure how the UPI interfaces become PCIe 3.0 lanes, but that's the experiment I would run.
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
Thanks rvassar, getting anything like 7.8GB/sec would be nice, we're getting around 7Gbits according to the NIC graph, so less than 1GB.

... This takes me back to one of my early posts about this; by using the DD commend, the read was miles slower than the write and when running in a striped mirror it just all but fell over. I would have thought the UPI/PCIe bottlenecks wouldn't be an issue just now given the strange write vs. read symptoms?
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
If you have a dual port 10GbE card, and a free PCIe slot, you might try configuring it to use the card, and not use the on-board 10GbE. The on-board 10GbE shares it's PCIe lanes with the M.2 interface, on-board SATA, USB, etc... Any activity on any of those takes away bandwidth to the on-board 10GbE ports. Like I said I'm not entirely sure how the UPI interfaces become PCIe 3.0 lanes, but that's the experiment I would run.

The client is almost at the point of shelling out for a faster processor, but getting one with a high clock rate (on this socket) is around £1,000 - it's crackers... We can try a different NIC, I know we're using the M.2 slot for booting FreeNAS but nothing else should be interfering with things as nothing else is bugged in - we made sure to disable all on-board SATA controller/ports for this reason :(
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Thanks rvassar, getting anything like 7.8GB/sec would be nice, we're getting around 7Gbits according to the NIC graph, so less than 1GB.

... This takes me back to one of my early posts about this; by using the DD commend, the read was miles slower than the write and when running in a striped mirror it just all but fell over. I would have thought the UPI/PCIe bottlenecks wouldn't be an issue just now given the strange write vs. read symptoms?

Ok, but... In the absence of the CPU being pegged, I'm basicly trying to picture a data transfer budget for each step.

What steps have to happen to make that 1Gb/sec happen? Data comes in off the disk, read into memory/cpu cache, checksummed, etc... written to 10GbE port. That's 4 steps, two of which involve use of the two UPI interfaces. If they're the same two UPI interfaces, each proceeds at half the speed, as it round robins between them. So the 7.8Gb/sec becomes 3.9Gb/sec just because they're using the same UPI. If there's more steps, it gets cut in half yet again. That's before factoring competing traffic from any other PCH use. See how quickly it can collapse? Your last set of graphs indicates you're writing to the 104 Gb file to and from the M.2 device, and at one point you even seem to have some swap in use.

I suspect the on-board 10GbE is getting starved by sharing the PCIe lanes with the rest of the PCH.
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
Ok, but... In the absence of the CPU being pegged, I'm basicly trying to picture a data transfer budget for each step.

What steps have to happen to make that 1Gb/sec happen? Data comes in off the disk, read into memory/cpu cache, checksummed, etc... written to 10GbE port. That's 4 steps, two of which involve use of the two UPI interfaces. If they're the same two UPI interfaces, each proceeds at half the speed, as it round robins between them. So the 7.8Gb/sec becomes 3.9Gb/sec just because they're using the same UPI. If there's more steps, it gets cut in half yet again. That's before factoring competing traffic from any other PCH use. See how quickly it can collapse? Your last set of graphs indicates you're writing to the 104 Gb file to and from the M.2 device, and at one point you even seem to have some swap in use.

I suspect the on-board 10GbE is getting starved by sharing the PCIe lanes with the rest of the PCH.
I'll look at this is more detail. The graph and m.2 are transfers from a m.2 in a workstation to the nas so not internal to the nas.

Found this online:
"The two sockets are connected with two Intel Ultra Path Interconnect (UPI) links, which deliver increased bandwidth and performance over the Intel Quick Path Interconnect (QPI) that was used in previous generations of Intel Xeon processors. The UPI runs at a speed of 10.4 gigatransfers per second (GT/s). Each link contains separate lanes for the two directions. The total full-duplex bandwidth (2 links x 2 directions) is 41.6 gigabytes per second (GB/s).

Note: With full-duplex communication between two components, both ends can transmit and receive information between each other simultaneously. With half-duplex communication, the transmission and reception must happen alternatively."

... Will look into the CPU and mboard later on, need some sleep before getting up for work.

Sent from my HTC One M9 using Tapatalk
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Each link contains separate lanes for the two directions. The total full-duplex bandwidth (2 links x 2 directions) is 41.6 gigabytes per second (GB/s).


That certainly sound much more promising! I knew I was missing some bit of information.
 

Minxster

Dabbler
Joined
Sep 27, 2013
Messages
36
I had a long call today with the customer today and we're going to try and see if the HBA is actually an issue (use onboard controller for smaller JBOD)... This is based on the original DD testing I did where a striped mirror array just fell on it's ar$e, and also because the CPU core(s) just didn't max out at any point... The HBA is on P16 which is the highest firmware we can get for it.

rvassar - TBH your posted about UPI helped alot as it just made me ask questions about the whole rig; why did it show signs of being slow yet nothing "maxed out", hence no obvious bottleneck. Hence the HBA testing. Or more correctly, using the onboard controller to see if it acts the same as the HBA! (???)

We're just looking for ideas to test at this point, though we can't go too crazy as the server is now in production of sorts.

Again rvassar, there is no such thing as a bad idea especially if it can trigger other ideas! :)
 
Top