Network drops 5-90 mins after boot, suspect NIC?

SpaceBass

Cadet
Joined
Jul 4, 2023
Messages
5
TL;DR - machine worked for a year, recently started dropping network. Logs dont seem to show anything. NIC is HP Qlogic NC523SFP.

hardware:
  • Hyve Zeus with some sort of supermicro board
  • CPU - E3-1220
  • RAM - 16 GB (checked with Memcheck, seems fine)
  • System installed on a Samsung SSD
  • 6 10tb spinning drives
  • HP NC523SFP Qlogic NIC which uses the qlxgb kernel driver

details
hey folks,
I'm new here but I've been running TrueNAS Core for some time (although I'm by no means an expert). The box in question has been running for a year with no real issues. In the last week, it keeps dropping off the network. I've checked the logs (debug, console, message, system, and middleware) and I dont see much with one exception:

I have a kernel ARP moved message - it's a strange one, the MAC address for one of the messages is _almost_ identical to my router's WAN NIC, but the last two bits are different. The other MAC address doesn't appear in any part in my ARP table. I doubt this is a relevant message, but you experts will know better than I :)

After the device loses network, I try and run to the console as fast as I can to check the logs. Again, not much.

if I try and take the interface down with ifconfig ql0 down I get a lot of driver errors including some suggesting the queues are full. That's the only reason I also mention the ARP message... might that be filling up the queues? ¯\_(ツ)_/¯. I've also tried reloading the kernel driver; no joy.

The NIC is connected to my switch via a DAC cable. I've also tried 4 different SFP modules with two different fiber patches... no change.

From there, my only solution is a full reboot. Sometimes I get 24 hours, but most of the time I get 3-5 minutes before the thing drops off again.

I had enough issues with the NC523SFP in pfSense to know that while it is supported in FreeBSD, it is a bit of a flakey card. I have no problem replacing it... but I want to also make sure that's what is causing my issue.

I've ordered an Intel X520 NIC but it's a week away. Also, I'd just like to learn how to troubleshoot this system better. So, what would you do, test, or check? Any advice on tunables or settings?
 

SpaceBass

Cadet
Joined
Jul 4, 2023
Messages
5
Adding some network details

I'm only using one port (0) on the NIC. The NIC is configured with two vLANS:
vlan 1 - 10.15.1.5/24 with gateway of 10.15.1.1
vlan 1024 - 10.15.100.5/25 default GW 10.15.100.1

routing ACLs do allow some traffic between those vLANS, but generally anything accessing this box will come from the same network and interface.

For instance, a Proxmox host on 10.15.100.0 will access TrueNAS on 10.15.100.5. A desktop client on 10.15.1.0 will access 10.15.1.5

Some IOT devices will access via 10.15.9.0 through an ACL allowing, for example, NFS, to 10.15.1.5.
Not sure any of that matters, but.... wanted to share it
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I've ordered an Intel X520 NIC but it's a week away. Also, I'd just like to learn how to troubleshoot this system better. So, what would you do, test, or check? Any advice on tunables or settings?

Check out the 10 Gig Networking Primer in the resources section, which discusses various network cards. The short version is that really only two manufacturers are known to produce cards that work correctly for all use cases, most others have weird combinations of failure modes like "LACP plus jumbo frames = broken network". The FreeBSD hardware compatibility list is NOT a recommended source of information for what is good to use with TrueNAS.

If your system is a first gen E3-1220 Xeon, it is likely ten to twelve years old which is also a point at which hardware can start to fail. Your combining of a questionable mainboard plus a questionable ethernet card makes isolating the problem more difficult without parts swapping. It could be failing hardware, or driver updates resulting in new misbehaviour from old firmware, or need for a firmware update for the ethernet card, or.., or.., or...
 

SpaceBass

Cadet
Joined
Jul 4, 2023
Messages
5
Thanks @jgreco - lots of variables and age on this box. I'll keep poking around while I wait for the new (to me) Intel NIC.
 
Top