IPMI Watchdog 2 Hard Resets

Status
Not open for further replies.

schoffman

Dabbler
Joined
Jan 21, 2016
Messages
18
Hello,


I just built a Supermicro X11SSH-F Skylake server and have been having ipmi watch dog hard resets during high server loads. The IPMI log for the system is:

1 2016/03/25 02:24:13 Watchdog 2 #0xca Watchdog 2 Timer Interrupt - Assertion
2 2016/03/25 02:24:14 Watchdog 2 #0xca Watchdog 2 Hard Reset - Assertion
3 2016/03/27 06:26:54 Watchdog 2 #0xca Watchdog 2 Timer Interrupt - Assertion
4 2016/03/27 06:26:55 Watchdog 2 #0xca Watchdog 2 Hard Reset - Assertion

Doing research I found this command:
root@Test:/ # ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 137 sec
Present Countdown: 136 sec

I have had 4 hard resets over the course of 1.5 weeks, but they all occurred during the Plex maintenance which generates preview thumbnails video files. This creates server loads of around 3.4-3.8 on my 4 core 8 thread intel e3-1240v5 cpu. Since my volume is encrypted nothing runs after a hard reset until I unlock the volume and restart the jails. What I have done to debug it is move from the plugin plex to plex installed in a jail with a newer version. I just upgraded to freenas 9.10 stable from the latest 9.3 stable and the problem just occurred last night. I just issued a “killall watchdogd” on the system to disable the watchdog timer to see if the system is actually hanging. I know the watchdog is disabled by checking the ipmi.

I did burn in my system using Win10 on usb with Passmark "Burn In Test" which can test all the hardware and I never received and error. Since my build was not trouble free, (struggled with Supermicro's lack of fan control), I didn't document exactly how many burn in tests I ran but there never were any errors. CPU and Memory have at least 24 hours of testing.


Does anybody have any suggestions to try next? Are there any logs that I should check?


System:
Supermicro X11ssh-F
Intel Xeon e3 1240 v5
2x supermicro recommended samsung 16gb unbuffered ecc ram for total of 32gb
6 x WD Red 4 TB in Raid Z2 encrypted
250 GB Samsung ssd boot drive (plan to move to a usb now the that 9.10 is out)

Plex in Jail
Hdhomerun dvr record engine in another jail
 

schoffman

Dabbler
Joined
Jan 21, 2016
Messages
18
So Update,

I updated bios and the ipmi firmware. They were originally both built in 9/15 and now they are up to 12/15
The BIOS went from 1.0 to 1.0a (I couldn't find a change log)
The IPMI went from 0.50 to 1.11 (again no changelog)

I should have said in the previous post that I had disabled the watchdog in the bios and with the motherboard jumper. Now with the firmware updates the watchdog is disabled in the ipmi even though watchdogd is running on the OS.

This makes me believe maybe something was fixed in relation to the watchdog, I don't think I will be enabling it via the jumper anytime soon, as long as my system remains responsive.

As a sidenote being this my first experience with Supermicro I hope that there hardware design exceeds their software and website design. It was pretty involved to update the bios and Supermicros lack of fan control is frustrating compared to the top consumer boards.
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
Just for note, 9.10 update from last days got some IPMI watchdog improvements.
 

schoffman

Dabbler
Joined
Jan 21, 2016
Messages
18
Just for note, 9.10 update from last days got some IPMI watchdog improvements.

Thank you for your response. I did notice that and I did update to that version but still had the problem. I'm hoping the problem was in the ipmi firmware that was fixed in the update. There really isn't any reason for a watchdog when you have ipmi and a encrypted volume because a hard reset doesn't bring up the system. I have manually enter the password, and if the system does ever hang and I can still hard reset remotely with ipmi.
 
Status
Not open for further replies.
Top