SOLVED How to investigate random reboots ? => Bad PSU

Status
Not open for further replies.

Marcet

Contributor
Joined
May 31, 2013
Messages
193
I'm facing a problem with my main server : random reboots :(
I've been running this machine two weeks without problems. But now it reboots itself randomly from time to time.

I've been searching for forum threads but I can't be sure of what's appenning.

The first two weeks, I've been doing only file sharing, TimeMachine backups for 6 Macintosh, Emby server.

After that, I've added Transmission and OwnCloud plugin. I've been testing VirtualBox Jail.
And I've moved my Jail Root to a 6x Crucial SSD in RaidZ2.

I have several hypothesis :

1) The Jail Root on a RaidZ2 SSDs cause the problem. I've seen a thread on the forum regarding problems with SSDs (it was with VirtualBox I think). I've stopped the VirtualBox Jail, but the problem remains.

2) I've got a bunch of Watchdog events in the log :
Code:
1    2016/04/01 22:29:32    Watchdog 2 #0xca    Watchdog 2    Timer Interrupt - Assertion
2    2016/04/01 22:29:33    Watchdog 2 #0xca    Watchdog 2    Hard Reset - Assertion
3    2016/04/05 06:25:11    Watchdog 2 #0xca    Watchdog 2    Timer Interrupt - Assertion
4    2016/04/05 06:25:12    Watchdog 2 #0xca    Watchdog 2    Hard Reset - Assertion

Those correspond to the last 2 reboots.
After reading the forum thread about Watchdog Reboots, I've disabled watchdog in BIOS and removed the jumper on the motherboard. I've also tried to disable the watchdog daemon in rc.conf, but this didn't survived the reboot. So I've stopped the daemon manually after restart (not very convenient, should find a better solution if this is the main cause).

3) Hardware problem. How to determine that ?

I'll appreciate your help.
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
Random reboots are typically a sign of a bad power supply. Get yourself a PSU tester (typically around $20 on Amazon) to ensure you're seeing the correct voltages.

You can also check the logs in /var/log for anything untoward.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
I'm pretty sure I saw this issue related to IPMI watchdog. There was something in the actual IPMI web interface settings, I believe that needed to be adjusted or disabled.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
A $20 PSU tester is probably crappier than the crappy PSU you're trying to test... So unless the PSU is astonishingly crappy you'll not see much on the tester... :D
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
Random reboots are typically a sign of a bad power supply. Get yourself a PSU tester (typically around $20 on Amazon) to ensure you're seeing the correct voltages.
PSU is : Seasonic SS-600H2U 2u 80+ Psu
I have a tester, I'll check that.

You can also check the logs in /var/log for anything untoward.
Each time I look in /var/log/messages I found nothing. Just a hole in timecodes.
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
I'm pretty sure I saw this issue related to IPMI watchdog. There was something in the actual IPMI web interface settings, I believe that needed to be adjusted or disabled.
I've disabled watchdog in BIOS and removed the jumper on the board (by the way we have the same motherboard).
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
A $20 PSU tester is probably crappier than the crappy PSU you're trying to test... So unless the PSU is astonishingly crappy you'll not see much on the tester... :D
Also I can check the voltage in BIOS. That would be more precise.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Yep, but the thing is an okay DC voltage isn't that useful because you can have spikes, brown-outs, noise, ripple, ... that you can't see with this method.
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
Yep, but the thing is an okay DC voltage isn't that useful because you can have spikes, brown-outs, noise, ripple, ... that you can't see with this method.
Arf.. So I'm screwed. How to properly test a PSU then ?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
At least an oscilloscope (even an old one who goes only up to a few MHz is plenty enough) and a load (can be programmable load or your real server, even if with the server you'll not be able to test the PSU fully of course). And the knowledge to use it of course (because for example reading mV ripple on top of 12 V isn't that straightforward if you don't know how to do it) :)

But before going this way there's a ton of other things that can reboot the server that are easier to check ;)
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
At least an oscilloscope (even an old one who goes only up to a few MHz is plenty enough) and a load (can be programmable load or your real server, even if with the server you'll not be able to test the PSU fully of course). And the knowledge to use it of course (because for example reading mV ripple on top of 12 V isn't that straightforward if you don't know how to do it) :)

But before going this way there's a ton of other things that can reboot the server that are easier to check ;)
I must give up on this. I have neither the oscilloscope or the knowledge which goes with it.

Moreover it's a 600w Seasonic. That should be reliable, no ?

Here are the values from IPMI :
Code:
12V       Normal    11.937 Volts  10.173    10.299    13.26    13.386
5VCC      Normal    4.922  Volts   4.246     4.298     5.546    5.598
3.3VCC    Normal    3.316  Volts   2.789     2.823     3.656    3.69
VBAT      Normal    3.075  Volts   2.375     2.487     3.887    3.999
Vcpu      Normal    1.8    Volts   1.242     1.26      2.088    2.106
VDIMMAB   Normal    1.182  Volts   0.948     0.975     1.425    1.443
VDIMMCD   Normal    1.182  Volts   0.948     0.975     1.425    1.443
5VSB      Normal    5.078  Volts   4.246     4.298     5.546    5.598
3.3VSB    Normal    3.282  Volts   2.789     2.823     3.656    3.69
1.5V PCH  Normal    1.482  Volts   1.32      1.347     1.671    1.698
1.2V BMC  Normal    1.2    Volts   1.02      1.047     1.371    1.398
1.05V PCH Normal    1.032  Volts   0.87      0.897     1.221    1.248
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Yep, never seen better PSUs (excepted TDK-Lambda maybe but it's not the same world here...) but if it's overloaded (which likely isn't, even if you didn't give us the full hardware list) it can be the best PSU in the world it's still overloaded.
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
Yep, never seen better PSUs (excepted TDK-Lambda maybe but it's not the same world here...) but if it's overloaded (which likely isn't, even if you didn't give us the full hardware list) it can be the best PSU in the world it's still overloaded.
It's the full hardware list in my signature. Even if I'm still waiting for SATADOM (using USB instead).
As you mentionned, I think 600w is plenty for that config.
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
Is your FN box hooked up to a UPS? A UPS will help with under/over volt on the mains, also will help protect against brown outs and noise.
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
Is your FN box hooked up to a UPS? A UPS will help with under/over volt on the mains, also will help protect against brown outs and noise.
Yes, I've got a APC SmartUPS 1500 VA I forgot to mentioned in my signature (it's now corrected).
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874

Marcet

Contributor
Joined
May 31, 2013
Messages
193
There's a tunable mentioned in this thread (as well as checking the IPMI event log).
https://forums.freenas.org/index.php?threads/timer-interrupt-and-hard-reset.26356/page-2#post-220706
I've already stopped the daemon and used the command "ipmitool mc watchdog off".

Another clue might be the BIOS and BMC update I've made.
Code:
BIOS 1.0c  -> 2.0
BMC  3.20  -> 3.27

Should I try to go back to original BIOS and/or BMC revision ?

If yes, where can I found older version as SuperMicro website only shows latest ?
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
Turning off the watchdog didn't do it.

My next possible culprit : SSDs. So I moved all my Jails back to my HDD zpool and removed them from the system.

As a precaution, I also disabled the watchdog:

/etc/rc.d/watchdogd stop
Code:
Stopping watchdogd.
Waiting for PIDS: 1972.

ipmitool mc watchdog get
Code:
Watchdog Timer Use:     SMS/OS (0x04)
Watchdog Timer Is:      Stopped
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      0 sec
Present Countdown:      0 sec


Now, wait and see. My best uptime has been almost 3 days.
 

hugovsky

Guru
Joined
Dec 12, 2011
Messages
567
I've had 1 or 2 ramdom reboot with my backup system. It's this system:

Motherboard: SuperMicro A1SAi-2750F mini-ITX
CPU: Intel Atom C2750 CPU, 8-core 2.4 GHz
RAM: Kingston 4x 8GB 2Rx8 1G x 72-Bit PC3L-12800 CL11 204-Pin ECC SODIMM
Drives: 6x WD Red 3 TB, RAIDZ2


It is only used to zfs replication, nothing more. Solved by enabling autotune as read in a thread in the forum. I'll try to find it again.

Don't know if this will help in your case.
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
I've had 1 or 2 ramdom reboot with my backup system. It's this system:

Motherboard: SuperMicro A1SAi-2750F mini-ITX
CPU: Intel Atom C2750 CPU, 8-core 2.4 GHz
RAM: Kingston 4x 8GB 2Rx8 1G x 72-Bit PC3L-12800 CL11 204-Pin ECC SODIMM
Drives: 6x WD Red 3 TB, RAIDZ2


It is only used to zfs replication, nothing more. Solved by enabling autotune as read in a thread in the forum. I'll try to find it again.

Don't know if this will help in your case.
Would be great if you find it. Thanks.
 
Status
Not open for further replies.
Top