ARP expiration (maybe) causing loss of connectivity?

Hoffe

Cadet
Joined
Apr 19, 2020
Messages
6
Hi everyone.

Been lurking around the forums for years now, but now it's time to actually post something as well. (First post ever, yay!! :)

I have recently replaced our switches for a couple of new ones, with extra 10G ports I could attach our FreeNAS system to. Since the new switches has been introduced in our network, we have seen a lot of dropouts in connecticity. We're using the FreeNAS box as an iSCSI target with a couple of ESXi servers as initiators. This has been working flawlessly before.

I have been debugging the problem for days now, using these forums as a great source for ideas and best practice. At first I thought it was related to the use of jumbo frames in a 10G network, so I reconfigured our entire network to MTU 1500. Then I suspected it was iSCSI port binding, which I had configured but found out was of no use since we have our target IP's in different subnets. But the problem still persists. The ESXi logs shows a path as being down, and FreeNAS shows a log line similar to: (iqn.1998-01.com.vmware:esx01-3e9ab750): no ping reply (NOP-Out) after 5 seconds; dropping connection.

But today I boiled the problem down to having something to do with ARP! Or at least very suspiciously connected to ARP. When checking with arp -an on the FreeNAS machine, I can now predict which host will fail next as the dropout happens exactly when the arp expires! Sometimes fetching the ARP entry takes 1-2 seconds and sometimes it takes way more, stalling the iSCSI path. I have attached some screenshots showing the problem.

The switches are stacked Cisco SG550X, and the FreeNAS machine is connected to two separate ports on each physical switch, with 10GBASE-T and short CAT6a cables. There are no other signs of network problems or latency in anyway, everyting is running smooth and fast - except when ARP entries expires. The funny thing is that the problem occurs 99% only on one of the physical links/adapters (10.0.21.0/24). The paths on the other physical link (10.0.22.0/24) are somewhat unaffected. This tells me maybe it's a STP related issue, but don't have the knowledge to confirm it.

I have looked through all configuration on the switches, for something related to ARP but nothing really makes sense.

So.. has anyone experienced similar issues before? Anyone got an idea of how introducing newer/faster switches would introduce this problem? Is there any tweaking that can be done to fetch a new ARP before expiration?

In general Cisco equipment can be a bit "slow" to let new devices access the network. For example, when I test stuff with my laptop and plug in the cable to a port, it can take up to 10 seconds before a ping is answered. I'm thinking this is a variety of this. Maybe everytime an ARP expires on the FreeNAS machine, it has to "plug in" again and the switches are slow to respond, and provide access?

Hardware info:
FreeNAS-9.10.2 (a476f16)
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
262079MB Memory
Supermicro X9DRW-3TF+ motherboard
Intel X540 Dual port 10GBase-T onboard

Network:
2x Cisco SG550X-24.
FreeNAS server connected in 10GBASE-T ports with CAT 6a cables.

Screenshots:
vmsan_iscsi_problem_04.jpg vmsan_iscsi_problem_03.jpg vmsan_iscsi_problem_02.jpg
 

Hoffe

Cadet
Joined
Apr 19, 2020
Messages
6
A little more information..

I have verified that the switch ports in which the two FreeNAS links are hooked up, are both marked as Edge in the STP interface settings, which is equivelant to PortFast etc.

I have also verified that no storm control functionality is enabled in the switchstack. So it shouldn't be the switches delaying the ARP replies.

I tcpdump'ed 30 seconds before I could predict the next ARP related path-outage, but the only thing I can really conclude from it is that 27 ARP requests are being broadcasted, and that it takes 26.2 seconds to actually get a reply. o_O

Does anyone have a clue what could be the culprit here?

Wireshark screendump:
vmsan_iscsi_problem_05.jpg
 

Kcaj

Contributor
Joined
Jan 2, 2020
Messages
100
Im not smart enough to likley solve your problem;

The device with the vmware MAC would be where I would start looking (or the device you expect the resolve the ARP) though what subnet/interface is your default gateway?
 
Last edited:

Hoffe

Cadet
Joined
Apr 19, 2020
Messages
6
The device with the vmware MAC would be where I would start looking (or the device you expect the resolve the ARP) though what subnet/interface is your default gateway?
The wireshark lines with 10.0.10.x are not relevant, it's just because the interfaces goes promiscuous and catches that ARP traffic as well. 10.0.10.0/24 are my vmkernel/vmotion subnet. 10.0.21.0/24 is for the first path for iSCSI and 10.0.22.0/24 is for the second path for iSCSI.

No default gateways, since iSCSI doesn't need to be routed at all.

Weird think is that the FreeNAS are doing arp request in both .21 and .22 subnets. The .21 are the ones getting awful reply times, while on .22 they seem fine. Older Cisco 1G switches = no problem. New 10G Cisco switches = weird problems. I've totally exhausted my troubleshooting knowledge :rolleyes:
 

Kcaj

Contributor
Joined
Jan 2, 2020
Messages
100
As before if you want me to stop asking dumb questions just say ;)

If you unplug .22 interface do you still see a problem with the .21 alone?

The wireshark lines with 10.0.10.x are not relevant, it's just because the interfaces goes promiscuous and catches that ARP traffic as well.
You are capturing on FreeNAS which interface? If thats the case wouldnt that mean that the FreeNAS is on a trunk port and maybe not what you want?
 

Hoffe

Cadet
Joined
Apr 19, 2020
Messages
6
As before if you want me to stop asking dumb questions just say ;)
No dumb questions - I'm grateful you're taking your time to point out stuff. I have troubleshooted this for so many hours now, that I'm stuck in a a loop.

If you unplug .22 interface do you still see a problem with the .21 alone?
I'm afraid I don't have the luxury of unplugging cables, since this is a production system. But I did just do a small variety of what you wanted. I took one of the hosts, where there's only VM's that can tolerate a bit of downtime, and I down'ed both the switch ports one at at time. That way I could control which phys. switch the traffic was passing through. The symptoms we're exactly the same, no matter which port the traffic was flowing through.

Actually this time I tried forcing the problem on, by simply deleting the arp entry on the FreeNAS machine (instead of waiting for it to time out). When I ran arp -d 10.0.21.11 I lost like 12 seconds of ping packets. When I run arp -d 10.0.22.11, I don't miss a single ping - the arp request is instantly responded. W T F ?

I'm baffled! Same stacked switches, same configuration on all ports, same VLAN, happens both with/without iSCSI port binding (so ARP flux is out of the question). If I took my old 1G switches and plugged everything back in, I'm positive that this problem would go away. I'm beginning to think that I've hit some magic 10G vs. Cisco X550 vs. FreeBSD kernel vs. galactic interference bug.

You are capturing on FreeNAS which interface? If thats the case wouldnt that mean that the FreeNAS is on a trunk port and maybe not what you want?
When I run tcpdump it puts the nic in promiscuous mode, så it captures all traffic in the VLAN. Both iSCSI subnets are in the same operational VLAN across my network.
 

Hoffe

Cadet
Joined
Apr 19, 2020
Messages
6
Does anyone have a good idea on where/how to debug my network more thorough? As mentioned before I have exhausted my abilities in terms of network-debugging knowledge, and tcpdump+wireshark really gave me nothing. :(
 

Hoffe

Cadet
Joined
Apr 19, 2020
Messages
6
WELL.. Turns out I have been overcomplicating my troubleshooting the whole time, and overlooked one of the most obvious problems ever. My two physical interfaces had the same god damned MAC address! I suspect they have been bonded before or something like that. I haven't touched the MAC's ever.

In desperation I turned to the #cisco channel on FreeNode IRC, and a helpful soul asked me to check and verify the MAC address table on the switch, which lead me to lookup the MAC addresses of the FreeNAS nics, and then I got embarrased.

But hey.. problem solved.. and probably the easiest fix I have ever pushed out. :cool:

Stay safe!
 
Top