Hello,
I just built a Supermicro X11SSH-F Skylake server and have been having ipmi watch dog hard resets during high server loads. The IPMI log for the system is:
1 2016/03/25 02:24:13 Watchdog 2 #0xca Watchdog 2 Timer Interrupt - Assertion
2 2016/03/25 02:24:14 Watchdog 2 #0xca Watchdog 2 Hard Reset - Assertion
3 2016/03/27 06:26:54 Watchdog 2 #0xca Watchdog 2 Timer Interrupt - Assertion
4 2016/03/27 06:26:55 Watchdog 2 #0xca Watchdog 2 Hard Reset - Assertion
Doing research I found this command:
root@Test:/ # ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 137 sec
Present Countdown: 136 sec
I have had 4 hard resets over the course of 1.5 weeks, but they all occurred during the Plex maintenance which generates preview thumbnails video files. This creates server loads of around 3.4-3.8 on my 4 core 8 thread intel e3-1240v5 cpu. Since my volume is encrypted nothing runs after a hard reset until I unlock the volume and restart the jails. What I have done to debug it is move from the plugin plex to plex installed in a jail with a newer version. I just upgraded to freenas 9.10 stable from the latest 9.3 stable and the problem just occurred last night. I just issued a “killall watchdogd” on the system to disable the watchdog timer to see if the system is actually hanging. I know the watchdog is disabled by checking the ipmi.
I did burn in my system using Win10 on usb with Passmark "Burn In Test" which can test all the hardware and I never received and error. Since my build was not trouble free, (struggled with Supermicro's lack of fan control), I didn't document exactly how many burn in tests I ran but there never were any errors. CPU and Memory have at least 24 hours of testing.
Does anybody have any suggestions to try next? Are there any logs that I should check?
System:
Supermicro X11ssh-F
Intel Xeon e3 1240 v5
2x supermicro recommended samsung 16gb unbuffered ecc ram for total of 32gb
6 x WD Red 4 TB in Raid Z2 encrypted
250 GB Samsung ssd boot drive (plan to move to a usb now the that 9.10 is out)
Plex in Jail
Hdhomerun dvr record engine in another jail
I just built a Supermicro X11SSH-F Skylake server and have been having ipmi watch dog hard resets during high server loads. The IPMI log for the system is:
1 2016/03/25 02:24:13 Watchdog 2 #0xca Watchdog 2 Timer Interrupt - Assertion
2 2016/03/25 02:24:14 Watchdog 2 #0xca Watchdog 2 Hard Reset - Assertion
3 2016/03/27 06:26:54 Watchdog 2 #0xca Watchdog 2 Timer Interrupt - Assertion
4 2016/03/27 06:26:55 Watchdog 2 #0xca Watchdog 2 Hard Reset - Assertion
Doing research I found this command:
root@Test:/ # ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 137 sec
Present Countdown: 136 sec
I have had 4 hard resets over the course of 1.5 weeks, but they all occurred during the Plex maintenance which generates preview thumbnails video files. This creates server loads of around 3.4-3.8 on my 4 core 8 thread intel e3-1240v5 cpu. Since my volume is encrypted nothing runs after a hard reset until I unlock the volume and restart the jails. What I have done to debug it is move from the plugin plex to plex installed in a jail with a newer version. I just upgraded to freenas 9.10 stable from the latest 9.3 stable and the problem just occurred last night. I just issued a “killall watchdogd” on the system to disable the watchdog timer to see if the system is actually hanging. I know the watchdog is disabled by checking the ipmi.
I did burn in my system using Win10 on usb with Passmark "Burn In Test" which can test all the hardware and I never received and error. Since my build was not trouble free, (struggled with Supermicro's lack of fan control), I didn't document exactly how many burn in tests I ran but there never were any errors. CPU and Memory have at least 24 hours of testing.
Does anybody have any suggestions to try next? Are there any logs that I should check?
System:
Supermicro X11ssh-F
Intel Xeon e3 1240 v5
2x supermicro recommended samsung 16gb unbuffered ecc ram for total of 32gb
6 x WD Red 4 TB in Raid Z2 encrypted
250 GB Samsung ssd boot drive (plan to move to a usb now the that 9.10 is out)
Plex in Jail
Hdhomerun dvr record engine in another jail