Kernel panics - not sure if HW issue

Status
Not open for further replies.

Mr Snow

Dabbler
Joined
May 22, 2016
Messages
29
Hey folks,

First up I'd like to say that this forum has been a fantastic resource for me putting together my build. I haven't felt the need to post anything thus far because the information available with a search usually answers my question. Not in this case however :)

TLDR: I think my motherboard has an issue, but I'm not sure. Supermicro X11SSM-F.

Build:
  • MB: Supermicro X11SSM-F
  • CPU: Intel Xeon E3-1230 V5
    • Stock fan.
    • I (only today) ran a "breakin" test on this. When the CPU was running at 100%, the temp was hovering around 80°C with spikes above that.
  • RAM: Crucial 32GB Kit (2 x 16GB) DDR4-2400 ECC UDIMM (CT7982365 aka CT2K16G4WFD824A)
    • (On chip label says they are Micron MTA18ASF2G72AZ-2G3A1ZG. They are on the tested list (without the ZG on the end). IPMI HW info validates this.
    • memtestx86 run for over 24 hours with no issues reported (Passmark v6.3.0 due to UEFI and DDR4 support)
    • Motherboard reports this as 2133MHz
    • IPMI says Max speed 2400MHz, Operating Speed 2133MHz.
  • PSU: Corsair RM550x (I tried to get the SeaSonic, but it is not very available in Australia)
  • HDD: 7 x 3TB WD RED
    • HDD burn in performed properly (short, conveyance, long, badblocks -ws, long with no reported erros).
  • Boot: 2 x Cruiser Ultra Fit 16GB (USB3 sticks)
    • I'm aware of the heat concerns around these sticks. I don't think it is related to my problem however
  • Case: Fractal Design Node 804
  • IPMI firmware flashed to 1.13 as part of diag
  • BIOS version 1.0b
    • BIOS set to full UEFI
    • HDD spin up delay enabled
    • Otherwise mostly defaults (switched some other device settings from Legacy to UEFI).
I think that covers everything.

The first odd thing I noticed was when I was trying to use 3 Noctua fans I bought (Model: NF-F12 PWM). They seemed to run fine with an idle speed of 300rpm reported by IPMI (I had adjusted the thresholds per this thread). But intermittently (anything from a 5 second gap to several hours gap), IPMI would report one of the fans dropping to 0rpm. This would trigger a full speed spin up of all fans, and they would immediately settle back to 300rpm. However when I physically watched this happen, the fan in question did not stop spinning. I tried swapping FAN headers, etc but the problem remained. I've now removed all of the Noctua fans because even with 1 fan in the system, it still kept happening. I reported this to Supermicro support but they just advised me to use one of their fans.

More recently, I've progressed to getting the odd kernel panic. These seem to occur when I'm playing around with jails. It got to a point where just installing a new jail would cause a kernel panic.

A relevant section from the most recent crash dump:

Code:
<7>ifa_del_loopback_route: deletion failed: 48
Freed UMA keg (udp_inpcb) was not empty (240 items).  Lost 24 pages of memory.
Freed UMA keg (udpcb) was not empty (2171 items).  Lost 13 pages of memory.
Freed UMA keg (tcptw) was not empty (1035 items).  Lost 23 pages of memory.
Freed UMA keg (tcp_inpcb) was not empty (349 items).  Lost 35 pages of memory.
Freed UMA keg (sackhole) was not empty (375 items).  Lost 3 pages of memory.
Freed UMA keg (tcpcb) was not empty (89 items).  Lost 30 pages of memory.
hhook_vnet_uninit: hhook_head type=1, id=1 cleanup required
hhook_vnet_uninit: hhook_head type=1, id=0 cleanup required

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x378
fault code     = supervisor read data, page not present
instruction pointer   = 0x20:0xffffffff8098e0fd
stack pointer    = 0x28:0xfffffe08317c5720
frame pointer    = 0x28:0xfffffe08317c57b0
code segment     = base 0x0, limit 0xfffff, type 0x1b
       = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process     = 12 (swi4: clock)


My research suggests faulty memory, but the memtestx86 result refutes that. (I'll rerun the memtest if required). Some other research suggested the "Lost XX pages of memory" was an issue with VIMAGE and virtual NICs, but that was back in FreeBSD 8x days. This set of posts lead me to think that UEFI might be my problem as well.

I wiped the jail dataset in case the jail template was corrupt. Things went well until I was moving some files around inside a jail and I had another kernel panic. It was getting late (last night) so I decided to give up for the night. One of the reboots had an IPMI watchdog event associated to it. I have watchdog disabled in the BIOS.

This morning I had another look. I was playing with some jails again (copying a bundle in to plex via shell) when I got another kernel panic. I decided at this point to reset the BIOS to default just in case UEFI support was flaky in FreeBSD 10.3. Whilst I was in the BIOS screen (after I had loaded defaults), the system rebooted itself. IPMI reported another watchdog event, and strangely mentioned a Chassis intrusion event. I do not have anything hooked up to that header (JL1). I've got the system running at the moment to see if I can replicate the kernel panic in a consistent manner (haven't managed this yet), and the chassis intrusion alert is still active.

I suspect a hardware issue here, and most likely the motherboard (the fan speed reporting error and the chassis intrusion alert are my main trigger thoughts).

If anyone has any thoughts or suggestions on where to go from here, it would be greatly appreciated. I'm kind of stuck with what to do next. I can't really afford to buy new MB, CPU and/or RAM to test which component is actually problematic. And I'm not keen on completing my migration project (moving from QNAP 419PII to this system) until I have a stable FreeNAS box :)

Regards,

CJ

PS I've followed up with Supermicro support mentioning the reboots, chassis intrusion and watchdog trips. Awaiting an email reply.
 
Last edited:

rungekutta

Contributor
Joined
May 11, 2016
Messages
146
Fan issues, the erroneous intrusion alarm and rebooting while still in bios certainly sounds like h/w. Just a quick note that I have the same m/board and also have rebooting issues as soon as I try to use jails in FreeNAS - there's a thread on it ("jails and stability") and I haven't resolved it as other priorities took over. However as long as I stay away from jails things work perfectly so not quite the same scenario as yours.
 

Mr Snow

Dabbler
Joined
May 22, 2016
Messages
29
I posted in that thread. Thanks for letting me know about it... :)

As an interesting note, I forgot to mention in my OP that I had changed the jumper on JWD1 to hardware disable the system watchdog. I didn't at the time relate that to the chassis intrusion. Just now I swapped the JWD1 jumper back to its default setting and the chassis intrusion alert has gone away.

I'm actually starting to wonder if there is nothing wrong with the MB at all. I'll keep this thread updated with my progress :)

Regards,

CJ
 

Mr Snow

Dabbler
Joined
May 22, 2016
Messages
29
It looks like my issue was primarily the jails and stability thing (thread here).

Supermicro have given me an updated IPMI firmware to try in relation to the watchdog alerts. I'm hoping that it also might fix the FAN speed detection issue. If not, I'll just write it off as a compatibility issue between the MB and the fans, and then go shopping for different PWM fans that are quiet enough for the lounge room :)

Regards,

CJ
 

JDCynical

Contributor
Joined
Aug 18, 2014
Messages
141
memtestx86 run for over 24 hours with no issues reported (Passmark v6.3.0 due to UEFI and DDR4 support)
Just a side note on this, if there was/is a memory problem, memtest may not show it when running ECC memory because, hey, it's ECC, if there is a memory problem, it tries to fix it. ;)

Check your IPMI logs and see if there are any memory errors showing up as well, just to be sure.

IIRC, jgreco ran into this a bit ago with some system issues on an i3?

EDIT: Correction, it was @Ericloewe, per the man himself :)https://forums.freenas.org/index.php?members/ericloewe.37321/
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
IIRC, @jgreco ran into this a bit ago with some system issues on an i3?
No, that was me.

@jgreco doesn't waste time with i3s ("You call that a server CPU? This is a server CPU.").

Though memtest86 (non-plus), in the latest versions, is supposedly capable of meaningful interactions with the memory controller in order to grab ECC status.
I plan to check this out soon, since I have a DIMM that throws ECC errors. I'll be sure to share my findings then.
 

rungekutta

Contributor
Joined
May 11, 2016
Messages
146
I'm interested - what issues were they? I've also given up on Jails for now as they seem riddled with problems on my hardware (X11SSM-F, i3-6100, 1x16GB ECC, Seasonic 550W) and setup (latest 9.10-STABLE, simple RAIDZ2 setup (6x3TB) with CIFS and NFS shares). Best suggestion I got was faulty RAM but if so it's not one that memtest86 can detect.

Also I have yet to find someone who runs the same hardware and can tell me it works for them. You are pretty close @Ericloewe, albeit one model up on the CPU. Have you successfully run Jails without messing with the CIFS-service for Mac clients, and indeed without random kernel panics and reboots?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'm interested - what issues were they?
No issues. It's just that memtest86+ is blissfully unaware of ECC. So, if ECC corrections take place, it won't notice. That's why you should check the IPMI log, which will contain such events.

Have you successfully run Jails without messing with the CIFS-service for Mac clients, and indeed without random kernel panics and reboots?
That's not quite my use case. In any case, the hardware works fine and your issues are not suggestive of hardware incompatibility.
 

Mr Snow

Dabbler
Joined
May 22, 2016
Messages
29
I've seen nothing in the IPMI logs around memory issues. I've currently got 4 jails running and no random reboots. There was one exception I mentioned in the other thread, but I'm confident that was triggered by bad process on my part (I'll elucidate in that thread).

The only outstanding issue I have currently is the FAN speed issue. I'm planning to buy a different PWM fan that doesn't run quite so slow at the low end and see how I go.

Regards,

CJ
 

Mr Snow

Dabbler
Joined
May 22, 2016
Messages
29
Just to close off this thread... It looks it was an issue with the Noctua Fans that I got. Quite probably the model in context of using it on the X11SSM-F board, because all 4 that I ended up purchasing exhibited the same issue. I've since bought Fractal Design Venturi 120mm PWM fans that are working great for me.

Regards,

CJ
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
What was the issue with the Noctuas?

I have 120, 90 and 80mm Noctuas in my x10sri-F box and they're working great.

So quiet :)

I had to adjust the fan thresholds though.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What was the issue with the Noctuas?

I have 120, 90 and 80mm Noctuas in my x10sri-F box and they're working great.

So quiet :)

I had to adjust the fan thresholds though.
I imagine some of the really low-speed models might end up being too slow for the BMC to make meaningful measurements. Many models are rated for 300RPM+-20% minimum, which means something like 240RPM. The BMC's resolution seems to be 100RPM (at least in the configuration used by Supermicro), so that barely leaves you two least significant digits in your measurements. It's barely useful as a measurement.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Right, I had to go with 0,0,200 to prevent false positives in the logs.

All I care about though is 0 RPM or if it's spinning
 

Mr Snow

Dabbler
Joined
May 22, 2016
Messages
29
I imagine some of the really low-speed models might end up being too slow for the BMC to make meaningful measurements. Many models are rated for 300RPM+-20% minimum, which means something like 240RPM. The BMC's resolution seems to be 100RPM (at least in the configuration used by Supermicro), so that barely leaves you two least significant digits in your measurements. It's barely useful as a measurement.

This I think. The BMC was reporting 0 RPM (and triggering alarms of course) even though the fan was still spinning. I'm still tempted to try one of the Noctua Industrial 2000rpm ones. I've had success with my Fractal Design fans though, so I should stick with those :)

Regards,

CJ
 

sremick

Patron
Joined
Sep 24, 2014
Messages
323
I keep reading about this issue and started before I even built my desktop workstation (which uses a SuperMicro motherboard and Noctua fans). Somehow I've avoided the issue...

...and probably just jinxed myself.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
This I think. The BMC was reporting 0 RPM (and triggering alarms of course) even though the fan was still spinning. I'm still tempted to try one of the Noctua Industrial 2000rpm ones. I've had success with my Fractal Design fans though, so I should stick with those :)

Regards,

CJ
I think the issue is that in Optimal mode the BMC sets the fans to 20%, the 120mm Noctua NF-f12 stalls below 25% (according to bmc)
 

Mr Snow

Dabbler
Joined
May 22, 2016
Messages
29
I think the issue is that in Optimal mode the BMC sets the fans to 20%, the 120mm Noctua NF-f12 stalls below 25% (according to bmc)

An accurate assessment. I think avoiding the Noctua NF-F12, at least in these circumstances, is probably appropriate.

I think an investment in one of the faster Noctua fans would be a good thing for me to do. For testing of course.

Regards,

CJ
 
Status
Not open for further replies.
Top