SOLVED How to investigate random reboots ? => Bad PSU

Status
Not open for further replies.

Marcet

Contributor
Joined
May 31, 2013
Messages
193

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Random reboots are most frequently caused by failing/failed hardware, improperly configured hardware in the BIOS, or hardware that is incompatible.

Just for giggles, are there any BIOS updates or anything like that for your system? Maybe try resetting your BIOS to defaults and see what happens.
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
Just for giggles, are there any BIOS updates or anything like that for your system? Maybe try resetting your BIOS to defaults and see what happens.
I have the latest BIOS and BMC for the board.
But I was wondering if I should try reverting to old revision of both BIOS and BMC ?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I wouldn't try reverting to older stuff (that's how you brick your board). But, you could try defaulting the BIOS and BMC and only change the stuff you *must* change like boot device. If it starts working properly then you can start setting the BIOS to better (read: more optimized) settings. :D
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
I wouldn't try reverting to older stuff (that's how you brick your board).
I won't do it ;)

But, you could try defaulting the BIOS and BMC and only change the stuff you *must* change like boot device. If it starts working properly then you can start setting the BIOS to better (read: more optimized) settings. :D
When I updated BMC, I also reset the settings (as it was not recommended to keep it).
I'll try with the BIOS settings if I don't pass the 3 days mark. Thanks.
 

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
I've had 1 or 2 ramdom reboot with my backup system. It's this system:

Motherboard: SuperMicro A1SAi-2750F mini-ITX
CPU: Intel Atom C2750 CPU, 8-core 2.4 GHz
RAM: Kingston 4x 8GB 2Rx8 1G x 72-Bit PC3L-12800 CL11 204-Pin ECC SODIMM
Drives: 6x WD Red 3 TB, RAIDZ2


It is only used to zfs replication, nothing more. Solved by enabling autotune as read in a thread in the forum. I'll try to find it again.

I would be interested in this information, too, especially running which FreeNAS version and which settings were made by autotune.

BTW: CPU-/Mainboard- and RAM-wise my hardware is identical. I didn't see any instability (running almost 24/7 for about 12 months now) and never fiddled with autotune and/or tunables so far.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You have done quite a few things to change the configuration and it all seams to have started when you added the SSDs. I'm not sure what you have setup for jails but hopefully you have a configuration backup from before you added the SSD's to your system, and then maybe you could just restore that configuration file and see if everything is working again. Additionally I would re-enable all the watchdog timers you disabled. Basically roll your system back to a state where it worked. If you upgraded the motherboard firmware, I would leave that alone, do not go backwards. If you really think a firmware update caused this problem, open up a ticket with Supermicro, but it's going to be a tuff sale since you have made so many changes in a short period of time so figuring out what the issue is could take a while.

Also, your configuration lists you run the latest version of 9.10. Since there was a recent update to this, maybe rolling back a version as well. Of course I'm making an assumption you jump on the updates as soon as they come out.

Some advice for the future... Take your time making changes to a functional system. Add something and give it time to see if it causes an adverse reaction. This means if you perform a BIOS upgrade, wait a few days to ensure no problems crop up. If you are not good at keeping notes on changes you make to your system, I'd wait a week. I myself would wait a few hours while I put the system through it's paces and try to break it. If it lasts, time to move on, but that is just me.

As for if you suspect the problem to be the power supply, I'd recommend you find one you could install to see if the problem goes away. Even though I know how to use an O'scope, most people do not have one at home and the one's I have at work are classified so I can't take one home nor could I bring in my computer to connect to it. Having a good spare power supply is a very good useful tool to have on had. And just because the power supply you have is a Seasonic (my favorite brand), it doesn't mean it hasn't failed. Although at this point in time I don't suspect the PSU is the culprit. Run Memtest and a CPU stress test, these typically can root out a PSU issue, ensure you have all your drives connected to pull the maximum load as well, that includes the SSDs if you want to see if they are causing the PSU to overload. Additionally, the cables you are connecting to your SSDs, ensure they are in good condition.

Good Luck.
 

hugovsky

Guru
Joined
Dec 12, 2011
Messages
567
I would be interested in this information, too, especially running which FreeNAS version and which settings were made by autotune.

BTW: CPU-/Mainboard- and RAM-wise my hardware is identical. I didn't see any instability (running almost 24/7 for about 12 months now) and never fiddled with autotune and/or tunables so far.

This server only serves as zfs replication from other server, nothing more. I can't seem to find the thread.
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
You have done quite a few things to change the configuration and it all seams to have started when you added the SSDs. I'm not sure what you have setup for jails but hopefully you have a configuration backup from before you added the SSD's to your system, and then maybe you could just restore that configuration file and see if everything is working again.
I know. I should have been more careful and add some timing between actions.
But, you know what it is... I ran 2 FreeNAS servers based on old PCs from about 5 years and never had any problem other than failing hard drives.
So I was full of enthusiasm when I finally build a proper server grade machine ;)
Who has never been exited by tech launch me the first stone :D

Additionally I would re-enable all the watchdog timers you disabled. Basically roll your system back to a state where it worked.
For now, I've moved my Jails from SSD to the main HDD based volume and physically removed the SSDs.
If I pass the 3 days mark, I'll consider re-enabling the watchdog.
Do you mean all watchdogs ? BIOS, Hardware Jumper and Daemon ?

If you upgraded the motherboard firmware, I would leave that alone, do not go backwards. If you really think a firmware update caused this problem, open up a ticket with Supermicro, but it's going to be a tuff sale since you have made so many changes in a short period of time so figuring out what the issue is could take a while.
Ok.

Also, your configuration lists you run the latest version of 9.10. Since there was a recent update to this, maybe rolling back a version as well. Of course I'm making an assumption you jump on the updates as soon as they come out.
As a matter of fact, I installed 9.10 when I start to think about a watchdog problem.
I've read a thread telling that some fixing have been made to the whatchdog software stack.

Some advice for the future... Take your time making changes to a functional system. Add something and give it time to see if it causes an adverse reaction. This means if you perform a BIOS upgrade, wait a few days to ensure no problems crop up. If you are not good at keeping notes on changes you make to your system, I'd wait a week. I myself would wait a few hours while I put the system through it's paces and try to break it. If it lasts, time to move on, but that is just me.
I will be more careful for the futur. Thanks for the advice.

As for if you suspect the problem to be the power supply, I'd recommend you find one you could install to see if the problem goes away. Even though I know how to use an O'scope, most people do not have one at home and the one's I have at work are classified so I can't take one home nor could I bring in my computer to connect to it. Having a good spare power supply is a very good useful tool to have on had. And just because the power supply you have is a Seasonic (my favorite brand), it doesn't mean it hasn't failed. Although at this point in time I don't suspect the PSU is the culprit. Run Memtest and a CPU stress test, these typically can root out a PSU issue, ensure you have all your drives connected to pull the maximum load as well, that includes the SSDs if you want to see if they are causing the PSU to overload.
I have no 600w PSU to compare. But I will make the stress test in a few days to check the system stability under pressure.

Additionally, the cables you are connecting to your SSDs, ensure they are in good condition.
No SATA cables involved, but SAS cables and backplanes.
I can switch disks placement to check if every caddie is working well. But it should be ok, because I've made some position changes in the recent past.

Good Luck.
Thank you very much for your support and this very clear briefing. Appreciate it (y)
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Do you mean all watchdogs ? BIOS, Hardware Jumper and Daemon ?
Absolutely. It was working before so I say try to restore the original configuration if you can, or at least over time. Lets say all is good after your 3 days test, then I'd add in all of the watchdog timers. If it breaks then remove half of the timers and try again. Try to sort it out in the least amount of time possible. It may have been the SSDs or how they were configured, lets hope so just so you can get back to a fully operation and reliable system.

I know. I should have been more careful and add some timing between actions.
But, you know what it is... I ran 2 FreeNAS servers based on old PCs from about 5 years and never had any problem other than failing hard drives.
So I was full of enthusiasm when I finally build a proper server grade machine ;)
Who has never been exited by tech launch me the first stone :D
We all get like that, it's normal.
 

hugovsky

Guru
Joined
Dec 12, 2011
Messages
567
I've found it! My issue was this:

Code:
[root@nas-backup] /data/crash# less info.0
Dump header from device /dev/dumpdev
  Architecture: amd64
  Architecture Version: 1
  Dump Length: 122368B (0 MB)
  Blocksize: 512
  Dumptime: Mon Mar 14 15:47:31 2016
  Hostname: nas-backup.hsnetworks
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 9.3-RELEASE-p31 #0 r288272+33bb475: Wed Feb  3 02:19:35 PST 2016
    root@build3.ixsystems.com:/tank/home/stable-builds/FN/objs/os-base/amd64/tank/home/stable-builds/FN/FreeBSD/src/syskmem_malloc(16777216): kmem_map too small: 17581543424 total allocated  Panic String: kmem_malloc(16777216): kmem_map too small: 17581543424 total allocated
  Dump Parity: 1880663058
  Bounds: 0
  Dump Status: good


Searched and found this:

https://bugs.freenas.org/issues/13061

So, I enabled autotune and it's been a while since last reboot. This happened on a ZFS replicating server only. This server doesn't have services enabled. It just replicates another.

EDIT: Clarified server role.
 

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338

Marcet

Contributor
Joined
May 31, 2013
Messages
193
Guys, I'm proud to announce I've passed the 4 days mark :)

So the culprit is one of those nasty guys :
  1. Crucial MX200 SSDs
  2. Some kind of whatchdog issue
  3. UPS USB connection
My guess : #1 is the winner.

My next move will be to reconnect USB cable from UPS and wait another day before using watchdogs again (as I have to shutdown the server for that).
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
IIRC, the MX200s had an odd firmware bug or two that was patched in an update.
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
IIRC, the MX200s had an odd firmware bug or two that was patched in an update.

You're right Thanks. They have MU02 firmware right now.
474283Capturedecran20160411a124516.png
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Glad you figured it out.
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193

Marcet

Contributor
Joined
May 31, 2013
Messages
193
Yesterday, I've plugged back my UPS USB and applied the FreeNAS update.

After the reboot, I didn't disable the watchdog daemon manually, so it was running.
This morning, I've got a reset :(

Code:
5    2016/04/13 11:32:03    Watchdog 2 #0xca    Watchdog 2    Timer Interrupt - Assertion
6    2016/04/13 11:32:04    Watchdog 2 #0xca    Watchdog 2    Hard Reset - Assertion


If it has been the UPS, the machine should have shutdown and not reset (at least I believe so).
So I believe there is something wrong with the watchdog itself.

I want to disable it automatically.
But /etc/rc.conf is recreated at boot and won't keep my watchdogd_enable="NO".
And when I look at /etc/rc.conf.local, there is this warning upfront :
# THIS FILE IS RESERVED FOR THE EXCLUSIVE USE OF FREENAS CONFIG SYSTEM.
# Please edit /etc/rc.conf instead.
How to disable the watchdog properly ?
 

Marcet

Contributor
Joined
May 31, 2013
Messages
193
Forget what I said in the previous post. I had another reboot while rsyncing some folders thru ssd.
Note that the previous reboot was also during an rsync.
No event recorded this time in IPMI event log.

I've disabled UPS service and unplugged USB cable.
I let the software watchdog running and try again the same rsync.

It worked.

But then after 2 hours another reboot.

I'm running a memory test right now.
 
Last edited:

Marcet

Contributor
Joined
May 31, 2013
Messages
193
That's not memory. But I knew it anyway as it has been tested by the assembler.
2016-04-13-21.03.49.png
 
Status
Not open for further replies.
Top