Mr Snow
Dabbler
- Joined
- May 22, 2016
- Messages
- 29
Hey folks,
First up I'd like to say that this forum has been a fantastic resource for me putting together my build. I haven't felt the need to post anything thus far because the information available with a search usually answers my question. Not in this case however :)
TLDR: I think my motherboard has an issue, but I'm not sure. Supermicro X11SSM-F.
Build:
The first odd thing I noticed was when I was trying to use 3 Noctua fans I bought (Model: NF-F12 PWM). They seemed to run fine with an idle speed of 300rpm reported by IPMI (I had adjusted the thresholds per this thread). But intermittently (anything from a 5 second gap to several hours gap), IPMI would report one of the fans dropping to 0rpm. This would trigger a full speed spin up of all fans, and they would immediately settle back to 300rpm. However when I physically watched this happen, the fan in question did not stop spinning. I tried swapping FAN headers, etc but the problem remained. I've now removed all of the Noctua fans because even with 1 fan in the system, it still kept happening. I reported this to Supermicro support but they just advised me to use one of their fans.
More recently, I've progressed to getting the odd kernel panic. These seem to occur when I'm playing around with jails. It got to a point where just installing a new jail would cause a kernel panic.
A relevant section from the most recent crash dump:
My research suggests faulty memory, but the memtestx86 result refutes that. (I'll rerun the memtest if required). Some other research suggested the "Lost XX pages of memory" was an issue with VIMAGE and virtual NICs, but that was back in FreeBSD 8x days. This set of posts lead me to think that UEFI might be my problem as well.
I wiped the jail dataset in case the jail template was corrupt. Things went well until I was moving some files around inside a jail and I had another kernel panic. It was getting late (last night) so I decided to give up for the night. One of the reboots had an IPMI watchdog event associated to it. I have watchdog disabled in the BIOS.
This morning I had another look. I was playing with some jails again (copying a bundle in to plex via shell) when I got another kernel panic. I decided at this point to reset the BIOS to default just in case UEFI support was flaky in FreeBSD 10.3. Whilst I was in the BIOS screen (after I had loaded defaults), the system rebooted itself. IPMI reported another watchdog event, and strangely mentioned a Chassis intrusion event. I do not have anything hooked up to that header (JL1). I've got the system running at the moment to see if I can replicate the kernel panic in a consistent manner (haven't managed this yet), and the chassis intrusion alert is still active.
I suspect a hardware issue here, and most likely the motherboard (the fan speed reporting error and the chassis intrusion alert are my main trigger thoughts).
If anyone has any thoughts or suggestions on where to go from here, it would be greatly appreciated. I'm kind of stuck with what to do next. I can't really afford to buy new MB, CPU and/or RAM to test which component is actually problematic. And I'm not keen on completing my migration project (moving from QNAP 419PII to this system) until I have a stable FreeNAS box :)
Regards,
CJ
PS I've followed up with Supermicro support mentioning the reboots, chassis intrusion and watchdog trips. Awaiting an email reply.
First up I'd like to say that this forum has been a fantastic resource for me putting together my build. I haven't felt the need to post anything thus far because the information available with a search usually answers my question. Not in this case however :)
TLDR: I think my motherboard has an issue, but I'm not sure. Supermicro X11SSM-F.
Build:
- MB: Supermicro X11SSM-F
- CPU: Intel Xeon E3-1230 V5
- Stock fan.
- I (only today) ran a "breakin" test on this. When the CPU was running at 100%, the temp was hovering around 80°C with spikes above that.
- RAM: Crucial 32GB Kit (2 x 16GB) DDR4-2400 ECC UDIMM (CT7982365 aka CT2K16G4WFD824A)
- (On chip label says they are Micron MTA18ASF2G72AZ-2G3A1ZG. They are on the tested list (without the ZG on the end). IPMI HW info validates this.
- memtestx86 run for over 24 hours with no issues reported (Passmark v6.3.0 due to UEFI and DDR4 support)
- Motherboard reports this as 2133MHz
- IPMI says Max speed 2400MHz, Operating Speed 2133MHz.
- (On chip label says they are Micron MTA18ASF2G72AZ-2G3A1ZG. They are on the tested list (without the ZG on the end). IPMI HW info validates this.
- PSU: Corsair RM550x (I tried to get the SeaSonic, but it is not very available in Australia)
- HDD: 7 x 3TB WD RED
- HDD burn in performed properly (short, conveyance, long, badblocks -ws, long with no reported erros).
- Boot: 2 x Cruiser Ultra Fit 16GB (USB3 sticks)
- I'm aware of the heat concerns around these sticks. I don't think it is related to my problem however
- Case: Fractal Design Node 804
- IPMI firmware flashed to 1.13 as part of diag
- BIOS version 1.0b
- BIOS set to full UEFI
- HDD spin up delay enabled
- Otherwise mostly defaults (switched some other device settings from Legacy to UEFI).
The first odd thing I noticed was when I was trying to use 3 Noctua fans I bought (Model: NF-F12 PWM). They seemed to run fine with an idle speed of 300rpm reported by IPMI (I had adjusted the thresholds per this thread). But intermittently (anything from a 5 second gap to several hours gap), IPMI would report one of the fans dropping to 0rpm. This would trigger a full speed spin up of all fans, and they would immediately settle back to 300rpm. However when I physically watched this happen, the fan in question did not stop spinning. I tried swapping FAN headers, etc but the problem remained. I've now removed all of the Noctua fans because even with 1 fan in the system, it still kept happening. I reported this to Supermicro support but they just advised me to use one of their fans.
More recently, I've progressed to getting the odd kernel panic. These seem to occur when I'm playing around with jails. It got to a point where just installing a new jail would cause a kernel panic.
A relevant section from the most recent crash dump:
Code:
<7>ifa_del_loopback_route: deletion failed: 48 Freed UMA keg (udp_inpcb) was not empty (240 items). Lost 24 pages of memory. Freed UMA keg (udpcb) was not empty (2171 items). Lost 13 pages of memory. Freed UMA keg (tcptw) was not empty (1035 items). Lost 23 pages of memory. Freed UMA keg (tcp_inpcb) was not empty (349 items). Lost 35 pages of memory. Freed UMA keg (sackhole) was not empty (375 items). Lost 3 pages of memory. Freed UMA keg (tcpcb) was not empty (89 items). Lost 30 pages of memory. hhook_vnet_uninit: hhook_head type=1, id=1 cleanup required hhook_vnet_uninit: hhook_head type=1, id=0 cleanup required Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x378 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8098e0fd stack pointer = 0x28:0xfffffe08317c5720 frame pointer = 0x28:0xfffffe08317c57b0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock)
My research suggests faulty memory, but the memtestx86 result refutes that. (I'll rerun the memtest if required). Some other research suggested the "Lost XX pages of memory" was an issue with VIMAGE and virtual NICs, but that was back in FreeBSD 8x days. This set of posts lead me to think that UEFI might be my problem as well.
I wiped the jail dataset in case the jail template was corrupt. Things went well until I was moving some files around inside a jail and I had another kernel panic. It was getting late (last night) so I decided to give up for the night. One of the reboots had an IPMI watchdog event associated to it. I have watchdog disabled in the BIOS.
This morning I had another look. I was playing with some jails again (copying a bundle in to plex via shell) when I got another kernel panic. I decided at this point to reset the BIOS to default just in case UEFI support was flaky in FreeBSD 10.3. Whilst I was in the BIOS screen (after I had loaded defaults), the system rebooted itself. IPMI reported another watchdog event, and strangely mentioned a Chassis intrusion event. I do not have anything hooked up to that header (JL1). I've got the system running at the moment to see if I can replicate the kernel panic in a consistent manner (haven't managed this yet), and the chassis intrusion alert is still active.
I suspect a hardware issue here, and most likely the motherboard (the fan speed reporting error and the chassis intrusion alert are my main trigger thoughts).
If anyone has any thoughts or suggestions on where to go from here, it would be greatly appreciated. I'm kind of stuck with what to do next. I can't really afford to buy new MB, CPU and/or RAM to test which component is actually problematic. And I'm not keen on completing my migration project (moving from QNAP 419PII to this system) until I have a stable FreeNAS box :)
Regards,
CJ
PS I've followed up with Supermicro support mentioning the reboots, chassis intrusion and watchdog trips. Awaiting an email reply.
Last edited: