Chelsio T520-CR seems to have died after only 2-3 days of use

Buyasta

Cadet
Joined
Jun 9, 2019
Messages
5
I have a FreeNAS server running on a second-hand Supermicro X9 series 4U server, and I plan on buying another second-hand X9 or X10 server to run Proxmox on, for various VMs.
I'd like to be able to run these VMs directly off the FreeNAS server using iSCSI, or if the performance for that solution is lacking, run them locally on ZFS storage with snapshots regularly synced back to the FreeNAS box.

To this end, I wanted to set up 10GB networking, so after a little research and finding a number of people saying the fs.com generic optics worked perfectly in both, I bought the following hardware:
2x Chelsio T520-CR NICs (brand new, from eBay)
1x Mikrotik CRS305-1G-4S+IN (from a local supplier)
4x Generic Compatible 10GBASE-SR SFP+ 850nm 300m DOM Transceiver Module (from fs.com)
2x LC UPC to LC UPC Uniboot Duplex 0.2dB IL OM4 Multimode 2.0mm BIF Fiber Optic Patch Cable (from fs.com)

Here are the rest of the server specs:
Chassis: CSE-846BE16-R1200B
Backplane: BPN-SAS2-846EL1
Motherboard: Supermicro X9DRI-F
CPU: Dual E5-2640
RAM: 64GB DDR3 ECC (8x8GB sticks)
Storage Controller: LSI 9210-8i
Hard Drives: 6x WD Red 8TB, 6x WD Red 6TB
NICs: on-board Intel I350, Chelsio T520-CR

To begin with, I installed one of the T520s into the FreeNAS box and had it connected as follows:
T520 > Tranceiver > OM4 cable > Tranceiver > CRS305
And then patched the CRS305 into the rest of my network using the RJ45 port.

This worked nicely, and I was planning on installing the second T520 into a Linux box that's already running some KVM VMs, so I could test the iSCSI vs local performance.
When I woke up this morning (after it'd been running about 2-3 days), I found that my network connection to the FreeNAS server had dropped.

After using the IPMI to re-enable the onboard 1G networking and patching it back into the network, I SSH'd in to investigate, and found a bunch of errors regarding the T520.
I've tried rebooting both the server and switch, reseating the tranceivers at both ends, reseating the cable at both ends, switching the SFP tranceiver to the other port, and switching the T520 into a difference PCIe slot, none of which has had any effect.

Here's the dmesg output from when I initially SSH'd in, without having rebooted:

Code:
Firmware reports adapter error: Crash
CIM illegal transaction (0x2)
t5nex0: encountered fatal error, adapter stopped.
Fatal parity error (0x10)
t5nex0: encountered fatal error, adapter stopped.
edc0 err addr 0x50084: 0x480.
bist: 0x50028, status 2020103030f81619 2001107030fc1227 2010104830f81613 62706161bf83203 d10f00006c101a22 6c10048223020241 1170d10f0000 90066f0fb250422 ea345403112da1af.
3 EDC0 correctable ECC data errors
EDC0 uncorrectable ECC data error
t5nex0: encountered fatal error, adapter stopped.


And here's the dmesg output on subsequent reboots:

Init:

Code:
t5nex0: <Chelsio T520-CR> mem 0xfb300000-0xfb37ffff,0xfa000000-0xfaffffff,0xfbb04000-0xfbb05fff irq 64 at device 0.4 numa-domain 1 on pci11
cxl0: <port 0> numa-domain 1 on t5nex0
cxl0: Ethernet address: 00:07:43:2a:78:00
cxl0: 16 txq, 8 rxq (NIC); 8 txq, 2 rxq (TOE)
cxl1: <port 1> numa-domain 1 on t5nex0
cxl1: Ethernet address: 00:07:43:2a:78:08
cxl1: 16 txq, 8 rxq (NIC); 8 txq, 2 rxq (TOE)
t5nex0: PCIe gen3 x8, 2 ports, 22 MSI-X interrupts, 71 eq, 21 iq


Subsequent error:

Code:
Fatal parity error (0x10)
t5nex0: encountered fatal error, adapter stopped.
EDC0 uncorrectable ECC data error
t5nex0: encountered fatal error, adapter stopped.


It looks to me like my NIC has died, but after only 2-3 days of use, this seems pretty strange.
While I do have a fair bit of experience with servers (I'm a Linux sysadmin), all the servers I've built/installed/maintained have just been using 1G copper networking - this is my first time using SFP or 10G networking.

I do have the second card that I can drop in there and see if that works, but if there's any likelihood that this wasn't just a freak occurrence, I'm not too keen on risking killing the second card as well.

Is anyone able to offer any suggestions or guidance?
 
D

dlavigne

Guest
Did you figure this out? eg did it turn out to be a faulty card?
 

Buyasta

Cadet
Joined
Jun 9, 2019
Messages
5
No, I'm still not sure what the deal is.
I'm planning on putting the other card in to test it, I just haven't had time yet - hopefully tomorrow.
 

Buyasta

Cadet
Joined
Jun 9, 2019
Messages
5
Sorry for taking so long to update, I was finally able to take the time to swap the second T520 in there, and it's working nicely thus far.

So yeah, I'll just wait and see what happens with this one - hopefully it was just bad luck that the other one failed, rather than some other underlying issue - really the only external cause of the card dying that I can think of is if the motherboard PCIe is dodgy and threw too much power at it, but my RAID card and NVME PCIe adapter & NVME SSD are all working perfectly, so that seems unlikely, and I'm guessing it was just a bad card.
 
Top