Lockup then reboot, but why?

Status
Not open for further replies.

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
OK, so I've got this new server built.
I'm using a SuperMicro CSE-847A, X8DTH-iF motherboard, Intel Xeon X5690 3.46GHZ, 48GB of Nanya ECC unbuffered RAM, 12 HGST 4TB Desktar NAS drives, 5 LSI 9211-8i cards v20 firmware in IT mode, 1 SuperMicro AOC-STGN-I2S fiber card that currently isn't being used.
Supermicro SATADOM 32 GB Internal Solid State Drive SSD-DM032-PHI
FreeNAS-9.3-STABLE-201512121950

My problem has been lately if I put a decent load on the server, say showing about 40% in the CPU graphs, basically having files moving from multiple locations at the same time the server will lock up for a minute or so and then reboot.
The only thing in my server's IPMI logs are;
1,System Event,12/28/2015 12:28:32 Mon,Critical Interrupt,,Assertion: Critical Interrupt| Event = Software NMI
And this error happens every time the system boots.

Minus my chassis intrusion alerts because I don't have the case screwed tight at the moment.

My question is has anyone else been down this road?
Are there logs in FreeNAS I can refer to to hopefully track down what is causing this?
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
Here is my Kernel log on startup...


Code:
CPU: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz (3466.87-MHz K8-class CPU)
mps1: SAS Address from SATA device = 243e7d7a867ca999
mps1: SAS Address for SATA device = 243c8087ae78908b
mps1: SAS Address from SATA device = 243c8087ae78908b
mps1: SAS Address from SATA device = 243c818a8e7ca18b
mps0: SAS Address for SATA device = 242d7e87a1889a90
mps0: SAS Address from SATA device = 242d7e87a1889a90
da5 at mps1 bus 0 scbus1 target 3 lun 0
da5: <ATA HGST HDN724040AL A5E0> Fixed Direct Access SCSI-6 device
da5: Serial Number       PK2334PCKD89RB
da5: 600.000MB/s transfers
da5: Command Queueing enabled
da5: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da1 at mps0 bus 0 scbus0 target 1 lun 0
da1: <ATA HGST HDN724040AL A5E0> Fixed Direct Access SCSI-6 device
da1: Serial Number       PK1334PCJYPURS
da1: 600.000MB/s transfers
da1: Command Queueing enabled
da1: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da3 at mps0 bus 0 scbus0 target 3 lun 0
da3: <ATA HGST HDN724040AL A5E0> Fixed Direct Access SCSI-6 device
da3: Serial Number       PK1381PCJVLD2S
da3: 600.000MB/s transfers
da3: Command Queueing enabled
da3: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da4 at mps0 bus 0 scbus0 target 8 lun 0
da4: <ATA HGST HDN724040AL A5E0> Fixed Direct Access SCSI-6 device
da4: Serial Number       PK2334PEGN9K5T
da4: 600.000MB/s transfers
da4: Command Queueing enabled
da4: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da8 at mps2 bus 0 scbus2 target 0 lun 0
da8: <ATA HGST HDN724040AL A5E0> Fixed Direct Access SCSI-6 device
da8: Serial Number       PK1334PCKYVVYX
da8: 600.000MB/s transfers
da8: Command Queueing enabled
da8: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da9 at mps2 bus 0 scbus2 target 1 lun 0
da9: <ATA HGST HDN724040AL A5E0> Fixed Direct Access SCSI-6 device
da9: Serial Number       PK2338P4H8SV8C
da9: 600.000MB/s transfers
da9: Command Queueing enabled
da9: 3815447MB (7814037168 512 byte sectors: 255H 63S/T 486401C)
da10 at mps2 bus 0 scbus2 target 2 lun 0
ada0 at ahcich0 bus 0 scbus3 target 0 lun 0
ada0: <SATA SSD S9FM02.1> ATA-10 SATA 3.x device
ada0: Serial Number AF340757032400152351
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
SMP: AP CPU #8 Launched!
SMP: AP CPU #7 Launched!
SMP: AP CPU #2 Launched!
SMP: AP CPU #6 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #4 Launched!
Timecounter "TSC-low" frequency 1733433632 Hz quality 1000
vboxdrv: fAsync=0 offMin=0x2f3 offMax=0x1d2b
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Logs are going to likely be in the debug file. Can you post a debug (system -> Advanced -> Save Debug).
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
Here are my sensor readings...

Sensors_zpsutiux3yd.jpg


And here is what some idiot did that I had to correct.....
This wire was plugged into the power supply sensor cable to the power supplies.
20151222_202931_zpszweit88f.jpg



There was supposed to be two jumpers there configured as so...
Since contacting SuperMicro I have fixed this but suspect damage..
ji2c%20jumper_zpsh4oqrvsz.png
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
Logs are going to likely be in the debug file. Can you post a debug (system -> Advanced -> Save Debug).


Will do, working on it now......
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
File attached to previous message
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Here's what I see:

1. The box isn't crashing. So more than likely this is strictly a hardware failure/problem.
2. I don't see any serious problems in your logs. I see lots of minor problems that would normally give me the indication that the administrator of the server either isn't an advanced user or isn't detail oriented with resolving issues that show up in the logs.
3. Nothing else really looks horribly out of place or any signs of any problems. In fact, viewing the logs it looks like some just walked up and hit the reset button on the box. Everything is fine, then suddenly the box is booting up.

If #2 is true (nothing wrong with that if you're still learning FreeNAS) then you may have overlooked or misconfigured something in your hardware or BIOS.

Other than that, I don't have any other good advice for you. :(
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
OK, good to know, I will continue troubleshooting the problem. I'm thinking another stress test is in order as it reboots faster when it's given a decent load.
What configuration minor problems are you referring to if you don't mind me asking so I can look into it further?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
What configuration minor problems are you referring to if you don't mind me asking so I can look into it further?

That's a question with too many answers. That's like asking "why might my car not be able to get on the freeway and do 65MPH?" I can give you a 1000 reasons why you car couldn't even get to the freeway, let alone do 65MPH. ;)

You'd have to look and see what your BIOS settings are, etc. It is also totally possible there's a hardware problem and you have nothing misconfigured.
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
It is true there could be a mis-configuration though I took my time configuring the BIOS settings, looking up each setting if I didn't already know what it was to determine what would be the best setting for the current OS.

In regards to your vehicle not doing 65, could be a plugged fuel filter, it's happened to me, lol!!!!! I have a fuel pressure gauge and the fuel pressure dropped from 65psi to 0, yea, can you say no more acceleration!!! I've since fixed the troublesome filter issue.
Just trying to get grins out of that last comment :)

But yea, as soon as I know what the problem is I'll post back, I'll just have to do more testing and see if I can re-create the problem without putting my FreeNAS OS and/or it's data at risk.
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
Well I ran a stress test for a couple hours on the CPU and RAM. No errors and no reboot.
I'm running a hard drive stress test right now which should also stress the controllers and put a decent IO load on the system.
The only thing that was consistent during each reboot was high NIC utilization. So I'm working on trying to find a way to stress the NIC
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
I'm thinking it must have something to do with the NIC. I just saw a pause happen. I've been transferring files from the main server to the new server via ssh and it was stalled. So i turned on the monitor, saw the logs but nothing out of the ordinary, hit enter and the menu came up but no IP address was shown for the interface. Waited about another 10, 15 seconds and the menu refreshed with a IP address for the interface.

What does that mean? It's like the interface disapeared then came back.

I attached another debug. BTW, how is it you are looking in these debugs?
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
oh, and in regards to the stress tests, everything ran fine, no errors, and now that I think about it more the only time the system ever restarted was with high IO load on the NIC.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, these:
Jan 1 06:36:02 storage1 nmbd[7598]: [2016/01/01 06:36:02.177048, 0] ../source3/nmbd/nmbd_namequery.c:109(query_name_response)
Jan 1 06:36:02 storage1 nmbd[7598]: query_name_response: Multiple (2) responses received for a query on subnet 10.10.10.248 for name TECHSOLUTIONS<1d>.
Jan 1 06:36:02 storage1 nmbd[7598]: This response was from IP 10.10.10.252, reporting an IP address of 10.10.10.252.

Those are indicative of a network problem. Either you have a network loop so you're receiving the same packets twice (that would be very bad) or you have two machines named "TECHSOLUTIONS".

A loop can cause all sorts of weird and unexplainable problems, but I wouldn't expect that to be the cause of your problems. Considering the logs don't show the NIC going down, I do tend to think its possible that you should fix this and see if that resolves your issue.

It's possible your NIC is going bad and doing weird crap though. That can't really be ruled out without simply trying another NIC.
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
Understandably so. I am currently running a stress test in the NICs. Running them at full speed for at least a couple hours a piece from the server to a laptop using the instructions found here....
http://tuxtweaks.com/2014/11/linux-network-speed-test/

So far no errors or reboots but I'm still testing.

I also noticed that the NMI error that I was receiving in the IPMI log didn't come up when booting to a live CD. I do however receive the NMI error when FreeNAS loads.
software NMI error is still showing "Software NMI @ BUS:0 /Dev:1 /Func:0 - Asserted

And I would swap out the NIC but it's a onboard NIC, serverboard NIC. I've been trying to build a 10GB backbone for my network to go between servers that may more than likely solve this problem though building such a network on a budget is a real PITA. Finding a 10GB switch that is affordable is easier said than done. But still in the works.

I'll continue testing and see if maybe I can narrow down which NIC it is. Worst case scenario I'll disable the NICs and install a PCI Express NIC.
I'll also work on fixing that name error.
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
I've been testing the NIC for a while and still no errors. I believe something in the OS is causing the error on the NIC and the reboot. I have been in contact with a SuperMicro Tech about this and a NTP issue (Firmware issue) and he has pointed out that the NMI error points to the NIC but it only comes up when FreeNAS is booting.

After this last test tonight I'll be starting FreeNAS back up on the server. I think I'm going to disable one of the NICs and then hit the server with some file transfers and see how it reacts. I'll try this on each NIC. I'll post back my results.
 
Last edited:

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
OK, definitely a software or driver problem. I was able to transmit from the server to another machine at 80MB/s continuously non stop for hours from each NIC without issues on a Live CD. I did the same with receive, 80MB/s for hours, no errors. As soon as I loaded FreeNAS back up and started sending it files it didn't last more than 10 minutes before it locked up and rebooted.
I even changed the name on the server to avoid the name conflict.
I also changed from NIC 1 to NIC 2 with the same results.
 
Last edited:

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
I just changed ACPI from version 3 to version 2 and have been hammering the server for the past 8 hours without so much as a hickup. The server was transmitting about 700Mbps while receiving at about 200Mbps on the onboard gigabit NIC. However I am still getting the "System Event,01/15/2016 17:26:42 Fri,Critical Interrupt,,Assertion: Critical Interrupt| Event = Software NMI"
According to this site...
http://publib.boulder.ibm.com/infoc...c/com.ibm.sysx.7944.doc/806f03131701xxxx.html

It is saying the resolution is to check the device driver (which one?), reinstall device driver, update device drivers, update firmware (already up to date, I've checked a few times just to be sure)

So I'm very much at a loss. I'm not sure if it's hardware or software error and I'm not sure where to go from here. I've tried disabling things in the CMOS without successfully getting rid of the error. I even had to reset the CMOS because the system hard locked due to PCIE Native support being turned to disable.

As of this moment the SuperMicro tech is loading a similar system with FreeNAS to see if he can duplicate the error and I asked him what it would take just to send the board in to have it tested.
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
Further information............

1 01/15/2016 17:26:42 OEM Critical Interrupt Software NMI @ BUS:0 /Dev:1 /Func:0 - Asserted

00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)
00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)
00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22)
00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22)
00:14.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub System Management Registers (rev 22)
00:14.1 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22)
00:14.2 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22)
00:14.3 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle Registers (rev 22)
00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5
00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller
00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
02:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
04:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
06:04.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
I confirmed, SuperMicro STGN-I2S 10Gb fiber card was causing reboots. Installed it, hammered the server over the 1GB onboard NIC and REBOOT.

Mind you the 10Gb card was not connected to any switches so it had NO traffic.

So, now as far as I know the only thing left to solve is the NMI error.
 
Status
Not open for further replies.
Top