Watchdog errors

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
Hello all, I've been battling the watchdog errors lately. Its actually been happening since I purchased the server in December of 2015 but it has been happening more and more frequently. I've searched the forums but haven't been successful in finding a solution for this. I reached out to SuperMicro and they had be update my BIOS and IPMI last weekending but 6 days later, the NAS shutdown again.

Here is the log from the server.

20 2019/02/22 09:37:02 #0xca Watchdog 2 Timer Interrupt - Assertion
21 2019/02/22 09:37:03 #0xca Watchdog 2 Hard Reset - Assertion

FreeNAS version is 11.2-Release- U1

Does anyone have any suggestions on what I can do?
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
I was having the same issue with our production servers and ended up simply disabling watchdog. I never want my systems to auto-reboot.
 

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
My watchdog setting in the BIOS is actually disabled, forgot to mention that. It still failed when I pulled the jumper for the watchdog.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
I also had to put in the rc tunable watchdogd_enable and set it to NO.
 

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
Thanks for the info. I haven't really worked with tunables but was able to find some documentation on this. Do I use single quotes, double quotes, or no quotes around the value NO?
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
No quotes.
 

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
Tunable added and FreeNAS updated to 11.2 Update 2. Thanks for the help. Hopefully this post helps others becuase I wasn't able to find this tunable mentioned in other posts.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Make sure you reboot or explicitly disable watchdogd after putting in this tunable. I think service watchdogd stop should do it.
 

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
Update: No shutdowns since 2/23/19 until today. When looking at IPMI logs, the last log entry shows the watchdog shutdown on 2/23. Looking at logs in /var/logs doesn't show much either, until I powered up the NAS this morning. Any suggestions on what the next steps could be? Hardware Issues?
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Watchdog doesn't shut the system down afaik, only reboots it. Are you saying the system completely shut down?
 

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
Yes, even when I had the watchdog errors, I had to actually power the host back up. It DOES NOT reboot.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Hrmm perhaps it's something else then and that was triggering the watchdog logs as well.

Is this system plugged into some sort of battery backup unit?

Anything in the BIOS event logs?

Anything in the graphs you can correlate to this, like a spike in CPU?

Did you test the ram when you built the system?
 

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
System is plugged into UPS but I haven't had a change to configure the NMC card to take a look. All the server plugged into the same UPS are not shutting down.

I haven't checked BIOS logs, only IPMI. Is there a BIOS log other than IPMI?

Graphs don't help becuase upon reboot, it starts over.

The system was fully build by Super Micro and they did do a burn in test on all the hardware.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Power is probably stable then.

Try running ipmitool sel elist to get the system event log. See if anything jumps out.

The thing with the graphs is frustrating. I have a full external monitoring system but it's not complete due to the way FreeNAS works. That's an area that's always been a pain for me.

Even though they say they burned the system in, I would still do a basic due diligence memory test (try MemTest or memtest86++). You don't have to let it run too long. Bad memory seems to out itself quickly (though some cases not). I suspect this isn't the issue but I'd still do it to be thorough.

I'd also double check CPU temps.

Fire up a jail and inside the jail install stress (pkd install stress) then run stress --cpu NUMBER_OF_CPUS. Run dmesg | grep -i "Multiprocessor System Detected" to get the total number of CPUs/cores the system detected.

Then in another window on the main system run ipmitool sensor | grep CPU | awk '{print $4}' to get the CPU temps. I only have one CPU so the output looks like this:

Code:
# ipmitool sensor | grep CPU | awk '{print $4}'
35.000
na


Run that for a while and see what your temps look like.

If you want to see what the max temp for your CPU is, run dmesg | grep CPU: and get the CPU model. Mine is Intel(R) Xeon(R) CPU E5-2620 v3 so I Googled e5-2620 v3 maximum operating temperature or you might be able to just use CPU World to search for your CPU directly. My max temp is 72.6c.

Check if the sensors output is getting close to that max temp. I suspect not, but always good to check.
 

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
Just wanted to update you here...no reboots or shutdown since last post. I haven't check any of the logs either as I have been quite busy. I didn't implement a SNMP monitoring tool. Hopefully the monitoring there can shed some light on temp and other settings.

I can't really shutdown the NAS so I need to get another spare NAS I have and move all the data over before I can do more testing. Will report back when I do this in the near future.
 

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
This might be helpful information. The server shutdown again on 5/3. The only log I see in IPMI for that timestamp is this:

222019/05/03 20:16:25CPU2 TempProcessorThermal Trip - Assertion

I don't have two CPU's. Any ideas?
 

Jimmy Tran

Dabbler
Joined
Dec 27, 2015
Messages
33
Ok, the server shutdown on me again. This time I was lucky enough to come across some additional Synology NAS's. All my VM's have been moved off and can begin testing. I ran the full MemTest and nothing happened. I've been running a stress test on all 16 cores and it is a stead 52-53C. Manufacturer Max is 72.1C.

Any other ideas or suggestions? The interesting thing is the message is for CPU2 but I only have one CPU?
 

Tranquil IT

Cadet
Joined
May 14, 2019
Messages
3
It is probably far too late for Jimmy, but if anyone else runs into this problem with SuperMicro Mainboards the problem is likely the battery on the mainboard. Had the exact same problem with mine, worked fine for years and then randomly started shutting down with IPMI claiming the issue as Watchdog2. Tested literally every component and found them to be all good before Supermicro support suggested we change the battery on the mainboard which worked instantly. Hopefully this helps someone else out!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
Yeah, that's an important one to keep in mind: Systems will often go nuts in weird and wonderful ways when the system NVRAM battery is marginal.

Dead batteries are great, the system wakes up one day and has lost settings maybe. Fine. But that rarely happens, and when the battery gets marginal, NVRAM contents can end up corrupted. I've seen:
  • System won't even turn on (imagine a defective PSU)
  • System loses one of four memory channels
  • System sometimes starts booting but hangs mysteriously, sometimes shows the "boot failure" warning despite having successfully booted, and generally lots of pain getting the system to boot reliably.
So: bizarre failure modes that are hard to track down? Try the battery.
 
Top