Debugging spontaneous reboots

Status
Not open for further replies.

lungfork

Dabbler
Joined
Jan 15, 2013
Messages
16
Hello all,
We have two identical FreeNAS boxes, a primary and backup. The primary mirrors all data from our ZFS data volumes to the backup via rsync.

Lately, the primary box has been rebooting spontaneously. To my knowledge, it has happened twice in the last three weeks. The backup box has not had any issues.

We're on the latest build of FreeNAS 9.3 STABLE (201506292130).

Here are the hardware specs (note that the primary and backups are identical, the one exception being the 10GbE card):
SuperMicro MBD-X8DAH+-F-O Motherboard
Intel Xeon E5620
24 GB RAM
LSI 9211-8i SAS HBA
Emulex OCE10102-NM 10Gb Ethernet card
24x 4TB HGST hard drives split into three RAIDZ1 volumes
2x WD Scorpio Black hard drives (FreeNAS boot device)

I'm looking for some basic troubleshooting advice. I've pored through the syslogs on the system and can't find any explanation for the reboots. I found a syslog-ng.core file in /var/db/system/cores, with a timestamp about the same time as the last reboot, is this the coredump file that I should be looking through? Any advice on what to look for in this file?

Any help you could provide would be appreciated.

Jordan
 
Last edited:

lungfork

Dabbler
Joined
Jan 15, 2013
Messages
16
So I've been googling this all morning.

It appears that the /var/db/system/cores file is a dead-end. Seems to be a common problem for many with FreeNAS 9.3. My crash dump configuration in rc.conf is as follows:

Code:
# get crashdumps
dumpdev="AUTO"
dumpdir="/data/crash"
ix_textdump_enable="YES"


There is nothing in /data/crash. So does this mean I am not really dealing with a kernel panic? Any other issues that could be causing spontaneous reboots? The system has redundant power supplies, is on two different PDUs, is protected by a UPS and backup generator. The backup system mentioned in the first post is not rebooting spontaneously.

Thanks for reading this far.

Jordan
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
What about the logs from the Supermicro BIOS?

We had a problem with one of our VM hosts rebooting spontaneously, and the problem was related to one of the PSUs going bad. The bad PSU was incorrectly reporting the system load, which was triggering the "System Overload" condition, so the motherboard unceremoniously halted the system to prevent damage, and it then rebooted.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It could be a flaky 10GbE driver. Not many Emulex NICs around here.
 

lungfork

Dabbler
Joined
Jan 15, 2013
Messages
16
What about the logs from the Supermicro BIOS?

We had a problem with one of our VM hosts rebooting spontaneously, and the problem was related to one of the PSUs going bad. The bad PSU was incorrectly reporting the system load, which was triggering the "System Overload" condition, so the motherboard unceremoniously halted the system to prevent damage, and it then rebooted.

Just enabled the IPMI BMC controller and there was nothing in the logs, I presume yours had some sort of log entry showing a problem with the PSU?

It could be a flaky 10GbE driver. Not many Emulex NICs around here.
Yeah, that's my suspicion right now, though I feel like that would produce some kind of error in syslog or a kernel crash dump. This NIC was unsupported until we upgraded this system from FreeNAS 8 to 9.

Thanks for the feedback.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Just enabled the IPMI BMC controller and there was nothing in the logs, I presume yours had some sort of log entry showing a problem with the PSU?


Yeah, that's my suspicion right now, though I feel like that would produce some kind of error in syslog or a kernel crash dump. This NIC was unsupported until we upgraded this system from FreeNAS 8 to 9.

Thanks for the feedback.
What's the driver's name?
 

lungfork

Dabbler
Joined
Jan 15, 2013
Messages
16
Last edited:

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
Just enabled the IPMI BMC controller and there was nothing in the logs, I presume yours had some sort of log entry showing a problem with the PSU?

Correct. Supermicro generally does a pretty good job with their BIOS logs, so if it's something hardware (and only hardware), then you'd likely see it. If you're not seeing anything, then I'd strongly suspect it's something software/driver related.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If you're replacing the NIC, consider Intel 10GbE if you're okay with 7ishGb/s (I think that's what the current FreeBSD driver manages - though it's said that Intel's hardware is the best out there) and Chelsio if you need the full 10Gb/s. Both Intel and Chelsio are stable and widely-used.
 

lungfork

Dabbler
Joined
Jan 15, 2013
Messages
16
If you're replacing the NIC, consider Intel 10GbE if you're okay with 7ishGb/s (I think that's what the current FreeBSD driver manages - though it's said that Intel's hardware is the best out there) and Chelsio if you need the full 10Gb/s. Both Intel and Chelsio are stable and widely-used.

Thanks for the suggestion, that was my next question. I'm going to pull the NIC tonight and see if it remains stable. Thanks for all your help.
 

lungfork

Dabbler
Joined
Jan 15, 2013
Messages
16
Correct. Supermicro generally does a pretty good job with their BIOS logs, so if it's something hardware (and only hardware), then you'd likely see it. If you're not seeing anything, then I'd strongly suspect it's something software/driver related.

Thanks for pointing this out to me, it will be helpful to have IPMI up and running in the future. Thanks for your help.
 
Joined
Jul 3, 2015
Messages
926
I can second Chelsio


Sent from my iPhone using Tapatalk
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There are some FreeNAS builds that don't do proper crashdumps. You can create a crash by running 'sysctl debug.kdb.panic=1' and then reboot the box. If a crashdump is saved in /data/crash then crashdumps work properly. This also means that your box isn't likely crashing but having some kind of hardware problem. Of course, at this point it's going to be trial and error unless you find a smoking gun (or a smoking component).
 
Status
Not open for further replies.
Top