FreeNAS 11.2U5 random unexpected reboots

velocity08

Dabbler
Joined
Nov 29, 2019
Messages
33
Sorry for the delay

Nic:

2 x 10G SFP+ ports via AOC-2UR68-i2XS (Intel® 82599ES

so far after upgrading to 11.3U2 we have been stable with no reboots.

currently up for 11 days.

MTU is still set to 9000.

since the common theme here is MTU and NIC's i'm wondering if some how changing the MTU on the NIC doesn't apply straight away to the switch port it's connected to.

Something i've noticed with iSCSI in general is that it can suffer from high queue depth if there is a perceived drop between host and client this can be due to fragemented packats and incorrect MTU on one of the network points Host, client or switch.

This shows as iSCSI queue depth and in turn creates high CPU load this scenario then triggers watchdog to reboot the host due to load.

im theorizing that original setup has MTU 1500 (in our case it was)
this is changed to 9000 on the fly and no manual reboot was performed
slowly queue depth increases eventually forcing a reboot even if storage is idle iSCSI packets are still sent/ received constently.
Possibly the switch port hasn't reset to the new MTU either and this is an additional issue.
some switches need the port shut down and started again to apply a change in MTU.
switching the host back down to 1500 MTU is then in line again with the switch at the time.

the above would explain why we are seeing similar issue but slightly different symptoms of the same same issue.

just an observation, potentially the best way to test would be:

set switch port to 9000 MTU
down/ up port
set host NIC to 9000 MTU
shut down host
Start up host
repeat for client
Perform iPERF network test to make sure packets are not fragmenting across network and can also use ping test set to 9000 MTU.

will need this process for all items host, client and switch so they are all set correctly.

On another note SuperMicro have advised we should flash the BIOS to update the CPU microcode in case this is part of the issue.

thoughts?

""Cheers
G
 
Joined
Apr 21, 2020
Messages
3
After velocity08's post on the 21st, I made sure to set my switches back to their default MTU, 1518, to match my NICs set to 1500. I then did a grueling 4+ hour random and sequential i/o test over NFS and came through just fine with great performance to boot. Last evening, during a very very very low period of i/o use, two more watchdog resets fired at:

- Sun Apr 26 16:33:28 2020
- Sun Apr 26 22:17:45 2020

These are the first two to occur on the same day. No scrubs were running at that time, and there's nothing of substance in the logs leading up to the hangs. I need to get this box's IPMI NIC attached so I can disable the watchdog and see what's going on in its console next time this hang occurs. Does anyone else have any ideas?
 

velocity08

Dabbler
Joined
Nov 29, 2019
Messages
33
Hi All

ive just has a report back from SuperMicro advising that there is a known Intel CPU issue across all hardware vendors which is related to the Caterr_IERR error we are seeing in the BMC health logs triggering the host reboot.

their fix is to upgrade to the latest version of the BIOS which has some mocrocode fixes to patch the CPU.

Current BIOS
version is 3.2
Server SUPERMICRO SYS-2029U-E1CRTP
Upgrade BIOS Version - 3.3
Link to our board update - https://www.supermicro.com.tw/en/products/motherboard/X11DPU

see attachment from Intel.

We will be flashing tonight and will update after a few days of use.

At present our system has been up for over 25 days using MTU 9000 the only difference we have made is to update FreeNas to latest version so far and the crashes had stopped.

It may just be a coincidence which is very possible.

""Cheers
G
 

Attachments

  • CATERR_IERR symptoms.pptx
    44 KB · Views: 254
Joined
Apr 21, 2020
Messages
3
My approach has been to disable the watchdog through an rc.conf addition and a cronjob, since the rc.conf was getting overridden. Since disabling the watchdog, I've had no unexpected reboots, and the system hasn't hung or panicked during the last two weeks or so. I have iDRAC plugged in, so I can quickly reboot if a hang occurs, but all's been good for now.
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
Hi All

I've just has a report back from SuperMicro advising that there is a known Intel CPU issue across all hardware vendors which is related to the Caterr_IERR error we are seeing in the BMC health logs triggering the host reboot.

their fix is to upgrade to the latest version of the BIOS which has some mocrocode fixes to patch the CPU.

Current BIOS version is 3.2
Server SUPERMICRO SYS-2029U-E1CRTP
Upgrade BIOS Version - 3.3
Link to our board update - https://www.supermicro.com.tw/en/products/motherboard/X11DPU

see attachment from Intel.

We will be flashing tonight and will update after a few days of use.

At present our system has been up for over 25 days using MTU 9000 the only difference we have made is to update FreeNas to latest version so far and the crashes had stopped.

It may just be a coincidence which is very possible.

""Cheers
G

Interesting find... Any more information? The motherboard linked is for the latest Socket P 'scalable' processors. Does it go back to older processors? I can't speak for others, but the system I saw it on was an R710 with E5540 CPUs.
 

velocity08

Dabbler
Joined
Nov 29, 2019
Messages
33
Interesting find... Any more information? The motherboard linked is for the latest Socket P 'scalable' processors. Does it go back to older processors? I can't speak for others, but the system I saw it on was an R710 with E5540 CPUs.
Hi @SubnetMask

ill paste the quote below directly from supermicro.

The errors you are seeing are symptoms of an INTEL issue that is known, and manifested across all vendors. There is a updated BIOS 3.3 which resolves all of the CATTERR issues.
Can you please update the Systems BIOS to 3.3 – and this should be able to be done remotely.

this was supplied with the attachment from the last post + some instructions to update the BIOS.

I would check your board and BIOS version for updates, read the release notes as they would provide more informaiton for your specific board and BIOS, as the statement reads its across all vendors but doesn't specifically go into CPU types just showcasing Intel as the source of the issues so must be some sort of microcode problem.

With knowledge of Intel's recent history spanning back over a decade+ to push out AMD from the market and all the side channel vulnerability created to deliver faster processing speeds, it doesn't surprise me that this sort of thing could be going back across many generations of CPU including current due to they way they the architecture is/ was build.

i see this as a much larger holy they are constantly trying to plug.

food for thought.

ill let you know after the weekend once we do the update and run for a few days.

""Cheers
G
 

Posfolife2

Cadet
Joined
Jan 21, 2020
Messages
2
Hey guys, ive been fighting this same problem for almost 6 months on multiple r710 servers.. was the final solution to disable the watchdog timer?
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
I never found a solution and reverted back to 11.1U7 and never had a problem until that FreeNAS was decomissioned. Still running 11.1U7 on my personal one. I'm a but 'gunshy' of the newer versions since that issue...
 

Vision_Thing

Cadet
Joined
Nov 3, 2023
Messages
2
Hi All,

Apologies to post on a thread that was last posted to in 2020 but I thought some information scratched up over the past two weeks :

1)
 

Vision_Thing

Cadet
Joined
Nov 3, 2023
Messages
2
Woops, and apologies. Apparently "tab" is not a good idea.

What I have picked up:
1) There seems to be some bugs in the Dell UEFI code. Several forums I browsed in this regard report this. Seems the UEFI/BIOS updates will come to an end as Dell no longer provides any support on R710/R720s.
2) We were running TrueNAS V13 U5.2 & U5.3 as a VM on VMWare ESXi 6.5 (Free version, which I believe is no longer available) and did not experience any problems unit we installed TrueNAS on the bare metal. I do not consider this to relate to TrueNAS, rather the Dell Firmware.
3) Several posts about putting PERC controllers in IT mode, and possible controller bricking raise concerns to us about possibly having to bin what is at present a functional server.
4) memtest (UEFI version) indicated "Your UEFI firmware has limited multi processor support" which I believe started pointing me in the correct direction
5) We are looking at going back to running TrueNAS as a VM on VMWare or PROXMOX. This seems to circumnavigate the cause of TrueNAS randomly rebooting. Limiting CPU cores, etc through the BIOS would be throwing resources to the wind and drop performance of TrueNAS to an unacceptable level. VMWare seems happy with talking to the RAID, and TrueNAS seems happy to address VMWare storage ... not ideal or perfect, but workable.

I hope the above info is of some use to somebody else. Converting the Dell R710/720 to a different SATA/SAS controller (outside of Dell's offerings) will lead to a need to modify the chassis which we are not willing to engage in (would rather replace the hardware).
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
1) There seems to be some bugs in the Dell UEFI code. Several forums I browsed in this regard report this. Seems the UEFI/BIOS updates will come to an end as Dell no longer provides any support on R710/R720s.
Gen 12 was only dropped fairly recently. Beyond that, there are no widespread reports of issues on Dell systems. Plus, FreeNAS 11 is pretty long in the tooth by now.
TrueNAS seems happy to address VMWare storage ... not ideal or perfect, but workable.
No, very, very far from workable, except for the boot pool, which is essentially disposable. The details are amply documented.
Converting the Dell R710/720 to a different SATA/SAS controller (outside of Dell's offerings) will lead to a need to modify the chassis which we are not willing to engage in (would rather replace the hardware).
No, that is incorrect. Even if you don't want to crossflash the controller on a Gen 12, for both Gen 11 and Gen 12 you can just add whatever controller you want in the standard PCIe slots. The SAS cables are all bog standard SFF-8087, though if you need to replace them it can be awkward to find the right lengths and connector orientations. This is mostly an issue if you want to use a newer controller with SFF-8643 connectors.
 
Top