Hey friends.
Recently i have upgraded my home lab and installed Mellanox Connect-X 3 Dual 40Gbps QSFP cards in all of my systems.
My TrueNAS system is running on a dedicated machine, and is connected to my virtualization server through 2x 40Gbps links with LACP enabled.
All my virtual machines on the virtualization server are running on iSCSI shares on top of the TrueNAS device through the network connection.
Some of my machines are mission critical.
Unfortunately on the TrueNAS side i am experiencing some issues. After 14 days uptime during night, the NIC has failed.
My issue seems to be similar to this one https://www.truenas.com/community/threads/melanox-connectx-3.73634/ but there is no solution yet.
Is there any suggestion in order to mitigate these issues?
Thanks in advance!
Recently i have upgraded my home lab and installed Mellanox Connect-X 3 Dual 40Gbps QSFP cards in all of my systems.
My TrueNAS system is running on a dedicated machine, and is connected to my virtualization server through 2x 40Gbps links with LACP enabled.
All my virtual machines on the virtualization server are running on iSCSI shares on top of the TrueNAS device through the network connection.
Some of my machines are mission critical.
Unfortunately on the TrueNAS side i am experiencing some issues. After 14 days uptime during night, the NIC has failed.
My issue seems to be similar to this one https://www.truenas.com/community/threads/melanox-connectx-3.73634/ but there is no solution yet.
Is there any suggestion in order to mitigate these issues?
Thanks in advance!
Jan 11 00:13:07 truenas kernel: pid 13537 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 00:37:45 truenas kernel: pid 13827 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 01:39:13 truenas MCA: Bank 15, Status 0x9c2030000000011b
Jan 11 01:39:13 truenas MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
Jan 11 01:39:13 truenas MCA: Vendor "AuthenticAMD", ID 0x800f82, APIC ID 0
Jan 11 01:39:13 truenas MCA: CPU 0 COR GCACHE LG RD error
Jan 11 01:39:13 truenas MCA: Address 0x40000000a8d0a00
Jan 11 01:39:13 truenas MCA: Misc 0xd01b0fff01000000
Jan 11 01:39:19 truenas mlx4_core0: Internal error detected:
Jan 11 01:39:19 truenas mlx4_core0: buf[00]: 00180c40
Jan 11 01:39:19 truenas mlx4_core0: buf[01]: 00000000
Jan 11 01:39:19 truenas mlx4_core0: buf[02]: 202a1388
Jan 11 01:39:19 truenas mlx4_core0: buf[03]: 00000000
Jan 11 01:39:19 truenas mlx4_core0: buf[04]: 00180c40
Jan 11 01:39:19 truenas mlx4_core0: buf[05]: 0021c500
Jan 11 01:39:19 truenas mlx4_core0: buf[06]: 00000001
Jan 11 01:39:19 truenas mlx4_core0: buf[07]: 00200630
Jan 11 01:39:19 truenas mlx4_core0: buf[08]: 00000000
Jan 11 01:39:19 truenas mlx4_core0: buf[09]: 00000000
Jan 11 01:39:19 truenas mlx4_core0: buf[0a]: 000101f5
Jan 11 01:39:19 truenas mlx4_core0: buf[0b]: 00000043
Jan 11 01:39:19 truenas mlx4_core0: buf[0c]: 00000000
Jan 11 01:39:19 truenas mlx4_core0: buf[0d]: 00000000
Jan 11 01:39:19 truenas mlx4_core0: buf[0e]: 00000000
Jan 11 01:39:19 truenas mlx4_core0: buf[0f]: 00000000
Jan 11 01:39:19 truenas mlx4_core0: device is going to be reset
Jan 11 01:39:20 truenas mlx4_core0: device was reset successfully
Jan 11 01:39:20 truenas kernel: mlx4_en mlx4_core0: Internal error detected, restarting device
Jan 11 01:39:20 truenas kernel[1896]: Last message 'mlx4_en mlx4_core0: ' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:20 truenas mlx4_core0: command 0x49 failed: fw status = 0x1
Jan 11 01:39:21 truenas kernel: lagg0: link state changed to DOWN
Jan 11 01:39:21 truenas kernel: mlx4_en: mlxen1: Failed activating Rx CQ
Jan 11 01:39:21 truenas kernel[1896]: Last message 'mlx4_en: mlxen1: Fai' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:21 truenas kernel: mlxen1: link state changed to DOWN
Jan 11 01:39:23 truenas WARNING: 192.168.7.10 (iqn.2005-10.org.freenas.ctl): connection error; dropping connection
Jan 11 01:39:23 truenas WARNING: 192.168.7.10 (iqn.1993-08.org.debian:01:17166b15916c): connection error; dropping connection
Jan 11 01:39:23 truenas WARNING: 192.168.7.10 (iqn.2005-10.org.freenas.ctl): connection error; dropping connection
Jan 11 01:39:23 truenas WARNING: 192.168.7.10 (iqn.1993-08.org.debian:01:17166b15916c): connection error; dropping connection
Jan 11 01:39:28 truenas mlx4_core0: Unable to determine PCI device chain minimum BW
Jan 11 01:39:28 truenas kernel: mlx4_en mlx4_core0: Activating port:1
Jan 11 01:39:28 truenas kernel: mlxen0: Ethernet address: 24:be:05:c4:4a:21
Jan 11 01:39:28 truenas kernel: mlx4_en: mlx4_core0: Port 1: Using 16 TX rings
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlx4_core0:' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas kernel: mlxen0: link state changed to DOWN
Jan 11 01:39:28 truenas kernel: mlx4_en: mlx4_core0: Port 1: Using 16 RX rings
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlx4_core0:' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen0: Using 16 TX rings
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen0: Usi' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen0: Using 16 RX rings
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen0: Usi' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen0: Initializing port
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen0: Ini' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas kernel: mlx4_en mlx4_core0: Activating port:2
Jan 11 01:39:28 truenas kernel: mlxen1: Ethernet address: 24:be:05:c4:4a:22
Jan 11 01:39:28 truenas kernel: mlx4_en: mlx4_core0: Port 2: Using 16 TX rings
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlx4_core0:' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas kernel: mlxen1: link state changed to DOWN
Jan 11 01:39:28 truenas kernel: mlx4_en: mlx4_core0: Port 2: Using 16 RX rings
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlx4_core0:' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen1: Using 16 TX rings
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen1: Usi' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen1: Using 16 RX rings
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen1: Usi' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas kernel: mlx4_en: mlxen1: Initializing port
Jan 11 01:39:28 truenas kernel[1896]: Last message 'mlx4_en: mlxen1: Ini' repeated 1 times, suppressed by syslog-ng on truenas.lan
Jan 11 01:39:28 truenas mlx4_core0: mlx4_restart_one was ended, ret=0
Jan 11 01:39:31 truenas kernel: mlx4_en: mlxen0: Link Up
Jan 11 01:39:31 truenas kernel: mlxen0: link state changed to UP
Jan 11 01:39:31 truenas kernel: mlx4_en: mlxen1: Link Up
Jan 11 01:39:31 truenas kernel: mlxen1: link state changed to UP
Jan 11 03:45:42 truenas kernel: pid 14383 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 04:32:00 truenas kernel: pid 17312 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 04:32:09 truenas kernel: pid 17977 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 04:32:13 truenas kernel: pid 17979 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 07:18:41 truenas kernel: pid 17981 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 08:14:53 truenas kernel: pid 17982 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 08:20:42 truenas kernel: pid 21198 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 08:23:05 truenas kernel: pid 21326 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 08:23:14 truenas kernel: pid 21290 (httpd), jid 0, uid 0: exited on signal 11
Jan 11 10:44:36 truenas kernel: mlx4_en: mlxen0: Link Down
Jan 11 10:44:36 truenas kernel: mlx4_en: mlxen1: Link Down
Jan 11 10:44:36 truenas kernel: mlxen0: link state changed to DOWN
Jan 11 10:44:36 truenas kernel: mlxen1: link state changed to DOWN