NFSD Crash, Watchdog issue, Nightly Reboot, X10-SRL-F

Status
Not open for further replies.

chani

Cadet
Joined
Dec 20, 2018
Messages
4
Hi there,

I've followed a lot of posts in this board in regard to the/an watchdog issue. Sadly I wasn't able to find a solution in this board nor by trying. First of all the Hardware:

- X10-SRL-F
- 128 GB ECC RAM
- 8x HGST Ultrastar 7,2k RPM (Just in the process of replacing 8x WD RED with those HGST Ultrastar disks) 6 TB each
- Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz
- Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
- 2x LSI HBA 9300-8i
- Currently no ZIL (we used to experiment with SSDs which have been made for write-intensive tasks, they did help a lot on our workload, however, currently not attached.)

We do have two of those machines. One day with version 11.1 the system started to reboot every night. At first we found some disk errors and thought some disks are defective while all SMART values of the disks are fine, the self-checks have been fine. We've replaced the disks in question and found them a few days later again with disk errors in dmesg. Then we replaced the LSI HBAs with new one but the issues persistet. Then we noticed that the Firmware of the LSI HBAs is from 2014 so we did a firmware and controller update - Since that the disk issues are gone - but the system is still rebooting multiple times every night.

So we replaced

- the CPU (issue persists)
- the M/B (issue still there)
- the RAM (issue still there - oh and actually, we increased RAM from 32 GB to 128 GB just to make sure)
- the PSUs (issue still there)
- the enclosures as well as the case (3U)
- all Multilane cables
- the 10G card

Finally (We're fighting with this issue for 3 weeks now) we've assembled a second server with the exact same hardware and same specs. The issue persists. The system is rebooting every night while under load. So we tried a FreeNAS Update. The Issue persists.

So we checked further. The crash log shows:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x3b8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80a8a60d
stack pointer = 0x28:0xfffffe2020ddf7a0
frame pointer = 0x28:0xfffffe2020ddf810
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 2723 (nfsd: service)
version.txt06000033313406611240 7607 ustarrootwheelFreeBSD 11.2-STABLE #0 r325575+fc3d65faae6(freenas/11.2-stable): Wed Dec 5 15:08:42 EST 2018
root@nemesis.tn.ixsystems.com:/freenas-11.2-releng/freenas/_BE/objs/freenas-11.2-releng/freenas/_BE/os/sys/FreeNAS.amd64

We're unsure if this is the reason for the reboots and if - how to fix those. We also deactivated HT in the BIOS to make sure it's not HT which is causing this issue. We may replace RAM and CPU a third time, though I guess that won't help anyway. Before 11.1 the system was running without any trouble. We're also very much confused why the system rebootssince we disabled the watchdog:

The X10-SRL-F comes with a Watchdog and IPMI. The IPMI Event log shows that everytime the System made a reboot, the "Watchog 2" was issuing a hard reset. So we pulled off the jumper (as described in the supermicro handbook) and disabled the watchdog in BIOS. Just to make sure we added ichwd_load="no" to loader.conf. Without any luck; the watchdog still reboots. So we tried setting it to yes and setting the timeout using ipmitool to 0 which should disable it - without any luck.

It seems there's a second watchdog which you cannot deactivate and which somehow connects to the IPMI (how else, should the IPMI Event Log show those?). We would love to disable ichwd (we do believe this one is the cause for our problems) though ichwd is not a module but compiled into the kernel.

So actually dealing with two issues:

- The Kernel fault above
- The Reboots every night

Since those issues started with 11.1 and we already replaced EVERYTHING I do presume, this is a software issue (I'm not sure though). Might be a Supermicro Issue with current FreeBSD/FreeNAS Systems. Attached the crash log.

Running out of ideas now. Any FreeNAS developer who would help us to fix the issue? We might hand out SSH/IPMI access to a system with this hardware for debugging purposes. Any information we might provide you so that you could help us? We're currently running tests with an older Board (Supermicro X9) which does have another Intel Chipset - I'll keep you updated.

Update:
- We also updated the BIOS of the M/B
- We also tried another 10G NIC
- With 11.1 the system was running fine for a few months. It just started some day.

Kind regards
Jean
 

Attachments

  • textdump.tar.0.gz
    60.5 KB · Views: 245
Last edited:

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
`ichwd_load="no"` does nothing, since the ichwd driver is statically linked into the kernel. You would need something like: `hint.ichwd.0.disabled=1`, assuming you really see `ichwd0` device in your dmesg output.
 
Status
Not open for further replies.
Top