Chelsio T6225-CR ECC Error reporting

SteveinSea

Cadet
Joined
Dec 8, 2023
Messages
3
I'm seeing an ECC unrecoverable error in the DMESG logs but not in the IPMI log. I think this is because the message is being generated by an adapter failure and not a DIMM. Since ECC unrecoverable errors are so rare, I haven't touched any hardware yet. I want to figure out whether I have a hole in the monitoring system first which missed a DIMM problem.

This problem is on an older system. Luckily, I have 2 identical systems (firmware, software, hardware, etc.). These systems have been running TrueNas/freeNAS for at least 5 years. I upgraded both to TrueNAS Scale 23.10.2 last week. One of the systems is throwing error messages that can be viewed in DMESG. The other system works fine.

On one boot attempt, it generated this:
[ 3.264782] cxgb4 0000:02:00.4: command 0x6 in mailbox 4 timed out
[ 3.264860] cxgb4 0000:02:00.4: Firmware reports adapter error: Crash
[ 3.264939] cxgb4 0000:02:00.4: encountered fatal error, adapter stopped
[ 3.265015] cxgb4 0000:02:00.4: "Firmware Default" configuration file error 6
[ 3.265095] cxgb4 0000:02:00.4: could not initialize adapter, error 6

On another boot attempt, it generated this:
[ 1.909033] cxgb4 0000:03:00.4: Coming up as MASTER: Initializing adapter
[ 2.636676] cxgb4 0000:03:00.4: Direct firmware load for cxgb4/t6-config.txt failed with error -2
[ 2.748517] cxgb4 0000:03:00.4: Successfully enabled ppod edram feature
[ 3.256520] cxgb4 0000:03:00.4: Successfully configured using Firmware Configuration File "Firmware Default", version 0x0, computed checksum 0x0
[ 3.576583] cxgb4 0000:03:00.4: max_ordird_qp 21 max_ird_adapter 407232
[ 3.624590] cxgb4 0000:03:00.4: ppod edram start 0x64ce00 end 0x7fffff size 0x1b3200
[ 3.660592] cxgb4 0000:03:00.4: Current filter mode/mask 0x632b:0x21
[ 3.742903] cxgb4 0000:03:00.4: 194 MSI-X vectors allocated, nic 32 eoqsets 32 per uld 16 mirrorqsets 32
[ 3.743023] cxgb4 0000:03:00.4: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 3.778482] cxgb4 0000:03:00.4 eth0: Chelsio T6225-CR 1G/10G/25GBASE-SFP28
[ 3.779285] cxgb4 0000:03:00.4 eth1: Chelsio T6225-CR 1G/10GBASE-SFP28
[ 3.824572] cxgb4 0000:03:00.4: Chelsio T6225-CR rev 0
[ 3.824657] cxgb4 0000:03:00.4: S/N: PT09210007, P/N: 110120960D0
[ 3.824735] cxgb4 0000:03:00.4: Firmware version: 1.27.1.0
[ 3.824815] cxgb4 0000:03:00.4: Bootstrap version: 255.255.255.255
[ 3.824897] cxgb4 0000:03:00.4: TP Microcode version: 0.1.23.2
[ 3.824978] cxgb4 0000:03:00.4: Expansion ROM version: 2.0.0.8
[ 3.825059] cxgb4 0000:03:00.4: Serial Configuration version: 0x1003000
[ 3.825141] cxgb4 0000:03:00.4: VPD version: 0x83
[ 3.825217] cxgb4 0000:03:00.4: Configuration: RNIC MSI-X, Offload capable
[ 3.844659] cxgb4 0000:03:00.4 enp3s0f4d1: renamed from eth1
[ 3.884612] cxgb4 0000:03:00.4 enp3s0f4: renamed from eth0
[ 28.004568] cxgb4 0000:03:00.4 enp3s0f4: SR module inserted
[ 28.032568] cxgb4 0000:03:00.4 enp3s0f4d1: SR module inserted
[ 44.356647] cxgb4 0000:03:00.4: Firmware reports adapter error: Crash
[ 44.357591] cxgb4 0000:03:00.4: command 0x17 in mailbox 4 timed out
[ 44.357956] cxgb4 0000:03:00.4: CIM TIMER0 interrupt (0x4)
[ 44.358957] cxgb4 0000:03:00.4: Firmware reports adapter error: Crash
[ 44.360174] cxgb4 0000:03:00.4: CIM illegal transaction (0x2)
[ 44.362705] cxgb4 0000:03:00.4: encountered fatal error, adapter stopped
[ 44.363604] cxgb4 0000:03:00.4: encountered fatal error, adapter stopped
[ 44.364837] cxgb4 0000:03:00.4: T4 fatal parity error (0x10)
[ 44.365671] cxgb4 0000:03:00.4: encountered fatal error, adapter stopped
[ 44.366774] cxgb4 0000:03:00.4: MC/MC0 uncorrectable ECC data error
[ 44.368280] cxgb4 0000:03:00.4: encountered fatal error, adapter stopped

ipmitool and the IPMI reporting system don't indicate an ECC uncorrectable event for any DIMM.

The system, which has other configured NICs, seems to function fine on those NICs. SMB multichannel also still works fine, it just doesn't use the interfaces for the Chelsio card.

This could be (is likely) an adapter failure; the system has a Chelsio T6225-CR because I have several of them and not because it delivers SMB at 25 gbps. But I'm checking to see whether the error encountered likely represents a failure in a DIMM or in the card or it is unknown. If it is a DIMM and the ECC uncorrectable event isn't being recorded in the system event log, I have other problems.

The forum rules ask for me to list a bunch of information about my hardware config that I think is largely irrelevant, but I'll include it anyway.
  • Motherboard make and model: Supermicro X9DRH-7F, firmware Revision: 3.62, IPMI Version: 2.0
  • CPU make and model: Intel Xeon E5-2643 v2 @ 3.50GHz
  • RAM quantity: 256 gb
  • Hard drives, quantity, model numbers, and RAID configuration, including boot drives. The system has 26 SAS-3 hard drives and a SuperMicro SATAdom for boot. I really hope no one wants to read their configurations. 3 of the drives are just hot standby.
  • Hard disk controller: Supermicro AOC-S3008L-L8E 12Gbps SAS-3 HBA P16 IT mode connected to a Supermicro SAS-3 EL1 expander backplane, the system also has an onboard SAS-2 controller that drives backup boot disks for the rear removable 2.5 SAS-3 drives.
  • Network cards: motherboard dual Intel I350 1 gb/sec NICs plus an IPMI NIC. Chelsio T6225-CR with 1 10gbps SFP+ and 1 25 gbps SFP+.
Any insight into what these error messages typically indicate would be useful. It seems like some of the behavior looks similar to another Chelsio thread about a dead adapter.
 
Last edited:
Top