FreeNAS 11.2U5 random unexpected reboots

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
I have a FreeNAS box in use that has been randomly experiencing unexpected reboots. This box is hosting iSCSI targets for a VMWare server, so it has VMs running on it all the time. The server is a Dell R710, and the connectivity between FreeNAS and the VMWare host is Dell Broadcom 10Gb SFP+ Fiber cards.

When I initially set this box up and had it running for several weeks for 'burn in', I didn't have a single hiccup. Ran totally rock solid the entire time. A few days after deploying it for production use, I noticed that it seemed to have rebooted based on an e-mailed report. The really odd thing there is it seemed like none of the VMs actually rebooted - they all showed more uptime than would be possible with the storage disappearing, thus triggering a hard reboot of all the guests. I checked the hardware logs via the iDRAC, and there are no hardware issues logged, and I haven't seen anything that stands out in FreeNAS.

Does anyone have any suggestions on where to look to try to determine why it seems to keep rebooting unexpectedly?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
You could start by looking at dmesg
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
Unless I'm missing something, dmesg only goes back to the beginning current bootup. I looked through the output, but there was nothing of note. I did some searching and found that there should be a dmesg, dmesg.0, dmesg.1.gz etc, but all I found in the /var/log folder for dmesg was dmesg.today and dmesg.yesterday, and the two are pretty much identical. Is there some additional logging that could/should be turned on, or somewhere else I should look?

I don't suppose the WebUI has something somewhere to download a log bundle like plex does? Personally, I think it would be real helpful if you could go somewhere in the WebUI and download a complete log bundle to sort through when you need to look for something.
 

agent_kith

Dabbler
Joined
Jan 2, 2014
Messages
15
I'm running FreeNAS-11.2-U6 on a HP Microserver Gen8.

Everything was working fine until I have made some changes:
* Added a HP 2 port 10 GBps NIC (QLogic Corp. cLOM8214). 1 port goes to the switch for NFS/CIFS shares, the other goes to a xcp-ng server for iSCSI. Both are using jumbo frames of 9000
* Existing Broadcom is now used for jails (VLAN tagged)

Started getting random reboots now with this change. Unfortunately xcp-ng is not behaving well when the NAS reboots. The VMs becomes unstable and I am forced to shut and restart them everytime the NAS reboots.

The files /var/log/messages.* seem to preserve the logs from previous boots, and I see the following (ql1 is my iSCSI interface).

Code:
Oct 22 01:25:04 nas ql1: qla_dump_buf8: qla_hw_send: wrong pkt 0x42 dump start
Oct 22 01:25:04 nas ql1: 0x00000000: 00 e0 67 0f 1e 2c 28 80 23 41 b5 7c 08 00 45 00
Oct 22 01:25:04 nas ql1: 0x00000010: 23 28 00 00 40 00 40 06 07 73 c0 a8 58 08 c0 a8
Oct 22 01:25:04 nas ql1: 0x00000020: 37 04 08 01 03 27 4f ae 79 bc 56 de 72 5b 80 10
Oct 22 01:25:04 nas ql1: 0x00000030: 71 c7 33 78 00 00 01 01 08 0a 59 2a 39 a5 30 1e
Oct 22 01:25:04 nas ql1: 0x00000040: 6d 95
Oct 22 01:25:04 nas ql1: qla_dump_buf8: qla_hw_send: wrong pkt dump end
Oct 22 03:45:00 nas ZFS: vdev state changed, pool_guid=5875210349648024569 vdev_guid=8171063140191737537
Oct 22 19:00:07 nas qla_dmamap_callback: bus_dmamap_load failed (27)
Oct 22 19:00:07 nas ql1: qla_get_mbuf: bus_dmamap_load failed
Oct 22 19:00:07 nas ql1: qla_replenish_jumbo_rx: qla_get_mbuf [1,(507),(1472)] failed
Oct 22 19:15:37 nas qla_dmamap_callback: bus_dmamap_load failed (27)
Oct 22 19:15:37 nas ql1: qla_get_mbuf: bus_dmamap_load failed
Oct 22 19:15:37 nas ql1: qla_replenish_jumbo_rx: qla_get_mbuf [1,(1775),(882)] failed
Oct 22 20:11:54 nas qla_dmamap_callback: bus_dmamap_load failed (27)
Oct 22 20:11:54 nas ql1: qla_get_mbuf: bus_dmamap_load failed
Oct 22 20:11:54 nas ql1: qla_replenish_jumbo_rx: qla_get_mbuf [1,(1521),(573)] failed
Oct 22 21:12:11 nas syslog-ng[2664]: syslog-ng starting up; version='3.20.1'
Oct 22 21:12:11 nas Copyright (c) 1992-2018 The FreeBSD Project.
Oct 22 21:12:11 nas Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Oct 22 21:12:11 nas     The Regents of the University of California. All rights reserved.
Oct 22 21:12:11 nas FreeBSD is a registered trademark of The FreeBSD Foundation.
Oct 22 21:12:11 nas FreeBSD 11.2-STABLE #0 r325575+5920981193f(HEAD): Mon Sep 16 23:00:13 UTC 2019
Oct 22 21:12:11 nas root@nemesis:/freenas-releng/freenas/_BE/objs/freenas-releng/freenas/_BE/os/sys/FreeNAS.amd64 amd64
Oct 22 21:12:11 nas FreeBSD clang version 6.0.0 (tags/RELEASE_600/final 326565) (based on LLVM 6.0.0)
Oct 22 21:12:11 nas VT(vga): resolution 640x480

Fair to assume jumbo frames caused this reboot? OP are you using the same NIC as I am?
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
I did a quick search for 'LOM8214', but didn't come up with anything conclusive, so I'm not really sure what chip that NIC uses, but the NICs in use on the affected system are QLogic/Broadcom 57810S 2 port PCIe NICs. But here's the kicker - I'm using one of the same on my FreeNAS, running on a R620, along with a 10G/GbE 2+2P 57800 rNDC (was running three of the four connections until recently, now running one connection on each NIC) and I haven't had a single hiccup, running 11.1U7. I'm PRETTY sure the 57810S and 57800 rNDC run the exact same controller, but either way, I do have one of the same exact NICs in my unaffected system.

I am running jumbo frames at 9000, and I never had even one error message on my setup until I changed things around and was able to provide multiple links into the FreeNAS box from my VMWare servers - previously, each VMWare box had one discrete link to FreeNAS, now, my setup has two links to FreeNAS, one into each of one FN410S switches, on separate VLANs and network scopes, and I've been getting 'no ping reply (NOP-Out) after 5 seconds; dropping connection' output, but this doesn't seem to be having any impact on functionality, as nothing crashes or vanishes. This did not start until I set up the multiple links.

Realistically, my setup that has been solid except for the seemingly benign 'NOP-Out' messages, is pretty close to functionally identical to the one that's rebooting except mine is still running 11.1U7 and the one having issues is running 11.2U5. I wonder if this is something somehow related to 11.2? Maybe that in combination with these particular NICs?

The odd part about the rebooting system is aside the SQL service on a SQL server stopping, the servers running on it don't seem to be rebooting after FreeNAS goes on it's brief walkabout. It's ALMOST as if it's not actually rebooting.

Are FreeNAS config backups 'version agnostic'? I know with VMWare, if you take a backup of an ESXi machine you need to restore it to the SAME build or things could go sideways in a hurry. If I decided I wanted to try it, could I take a backup from 11.2U5, reload the machine with 11.1U7 and restore that config to it, leaving it fully functional as if nothing ever happened?
 
Last edited:

agent_kith

Dabbler
Joined
Jan 2, 2014
Messages
15
I did a quick search for 'LOM8214', but didn't come up with anything conclusive, so I'm not really sure what chip that NIC uses
It's using the marvel 8200 series chipset.

I am running jumbo frames at 9000, and I never had even one error message on my setup until I changed things around and was able to provide multiple links into the FreeNAS box from my VMWare servers - previously, each VMWare box had one discrete link to FreeNAS, now, my setup has two links to FreeNAS, one into each of one FN410S switches, on separate VLANs and network scopes, and I've been getting 'no ping reply (NOP-Out) after 5 seconds; dropping connection' output, but this doesn't seem to be having any impact on functionality, as nothing crashes or vanishes. This did not start until I set up the multiple links.
I have that NOP-Out message too. But only once on each reboot. Hopefully it's this MTU. I've set it back to 1500 now and will monitor how it goes.
Hope it'll fix the random reboots I have.
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
It's using the marvel 8200 series chipset.


I have that NOP-Out message too. But only once on each reboot. Hopefully it's this MTU. I've set it back to 1500 now and will monitor how it goes.
Hope it'll fix the random reboots I have.

To be honest, I'm not sure it's MTU related - I've had my MTU set to 9000 since day one, but these 'NOP-Out' messages NEVER showed up when each VMWare machine only had one connection to the FreeNAS box - only once I changed things so that the VMWare machines each had two links back to FreeNAS did the NOP-Out messages start appearing for me.
 

agent_kith

Dabbler
Joined
Jan 2, 2014
Messages
15
You may be right. My guess was based on the logs above:
Code:
Oct 22 20:11:54 nas ql1: qla_replenish_jumbo_rx: qla_get_mbuf [1,(1521),(573)] failed
Oct 22 21:12:11 nas syslog-ng[2664]: syslog-ng starting up; version='3.20.1'
[/quote]
Now that I re-looked at it again, that's almost an hour apart. :D So the system just rebooted without any warning it seems.

Will see how it goes with MTU set to 1500, if it still reboots then I'll change from iSCSI to NFS and see if that'll fix it. Let me know if you can find any way to troubleshoot this.
 

agent_kith

Dabbler
Joined
Jan 2, 2014
Messages
15
It's been 5.5 days with iSCSI NIC's MTU set to 1500, no reboots. The other 2 NIC's MTU is still set to 9000.

So far so good, and there were moments when the iSCSI connection is hit hard - when I try to start multiple VMs at once. I'm still guessing iSCSI under certain conditions will reboot the NAS, but somehow lowering the MTU prevents that condition from happening.

Hopefully this can help people identify what is wrong, or give me some hints on how to troubleshoot. Not in a hurry to bump the MTU back up as I can't perceive any difference honestly.
 

agent_kith

Dabbler
Joined
Jan 2, 2014
Messages
15
All good until I updated to U7. FreeNAS rebooted twice..

Don't think BSD is playing nice with my 10G NIC. Will replace this next year. For now I hope to roll back to U6 and have everything running stable for another 2+ months...
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
Just a quick update on this - I got tired of the 'unscheduled system reboots', as well as the unscheduled reboots corrupting the ADDC that is running on it, and ended up rebuilding the machine with 11.1U7, with all the same settings, including 'mtu 9000' on the 10Gb NICs, and so far, it's been up for 16 days - prior to rolling it back to 11.1U7, it almost never made it more than 7 or 8 days between 'unscheduled system reboots' - there were a few points that it was less than 24 hours between reboots, so fingers crossed... If this continues and the system is solid, then it would seem that SOMETHING about 11.2 and either the Broadcom NICs or the MTU (or both), or perhaps the 'no ping reply (NOP-Out) after 5 seconds; dropping connection' messages that are being generated was causing some sort of kernel panic and reboot.
 

agent_kith

Dabbler
Joined
Jan 2, 2014
Messages
15
Just a quick update on this - I got tired of the 'unscheduled system reboots', as well as the unscheduled reboots corrupting the ADDC that is running on it, and ended up rebuilding the machine with 11.1U7, with all the same settings, including 'mtu 9000' on the 10Gb NICs, and so far, it's been up for 16 days - prior to rolling it back to 11.1U7, it almost never made it more than 7 or 8 days between 'unscheduled system reboots' - there were a few points that it was less than 24 hours between reboots, so fingers crossed... If this continues and the system is solid, then it would seem that SOMETHING about 11.2 and either the Broadcom NICs or the MTU (or both), or perhaps the 'no ping reply (NOP-Out) after 5 seconds; dropping connection' messages that are being generated was causing some sort of kernel panic and reboot.
In my case, the NAS will reboot if there is a high network load on the interface with jumbo frames on (9000). e.g. trying to boot up multiple VMs on a iSCSI share, or copying multiple GB files to a SMB share will trigger a reboot.

With MTU at 1500 the NAS is stable, even under loads. MTU set to 9000, it'll reboot.

I've swapped out my NIC and replaced it with a CHELSIO (sp?) over the weekend.. Don't see any NOP-Out messages, stress test the network and no reboots so far. At times pinging the remote IP took a wihle - 500ms!! But eventually everything runs ok.

Been only 2 days so probably too early to call. At this stage things looks promising.... It's also easy for me to know when the NAS has rebooted, as my VMs are stored on a iSCSI volume and the VMs will all fail when the NAS reboots.

Be good to find out what's the main cause of this problem, just to satisfy my curiosity...
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
I really wish there was some way to really pinpoint the cause instead of just speculating. I was never able to find anything useful in anything I looked at. It seems like it's almost certainly something with 11.2 though, but exactly what is purely speculation. As I said earlier, my own FreeNAS box is running 11.1 and has been rock solid - the 11.2 machine that was rebooting until I reloaded it with 11.1 was running essentially the exact same config as mine - Jumbo frames, the same Broadcom 10G NICs, iSCSI volumes provisioned to VMWare. In my case, this FreeNAS box has nearly no load on it (mine has a lot more load - the load average for mine is 5.98, 6.46, 7.61 and the load average for the one that was rebooting is 0.08, 0.09, 0.08) - it's got five VMs running, none of which are all that busy, and the reboots were all at random times - sometimes middle of the day, sometimes middle of the night, weekday, weekend - totally random, so I don't think load plays into the reboots I was seeing.
 
Last edited:

agent_kith

Dabbler
Joined
Jan 2, 2014
Messages
15
Suspect your case is different from mine then. Hardware is pretty unpredictable I guess. And so far only 2 of us have this issue. Probably some underlying FreeBSD problem and not FreeNAS itself per se. Could even be the switch (I'm using Unifi).

Just couldn't find the right logs to pin point what the problem is.

For me it's HP NIC + high network load + jumbo frames = high probability of reboot. Note it's network, not CPU. The HP NIC works fine when on 1500, but problematic when on 9000. The chelsio works fine on 9000 (so far).

Hopefully more can find their way here from Google. Need more data points.
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
So just another update on this, probably the last one. It's now been 55 days since I 'rolled back' to 11.1U7, and it hasn't even hiccuped since then. It's been up solid all 55 days, which is the longest it's been up to date. So I think that rules out an actual hardware issue, and it's pointing more to some sort of 11.2 issue, maybe with the Broadcom NICs, maybe jumbo frames, maybe some combination of the two, maybe some other seemingly unrelated thing. Really wish there was some way we could have determined what was ACTUALLY causing it, but at least in my case, rolling back to 11.1U7 solved the issue.
 

mattgorecki

Cadet
Joined
Feb 14, 2020
Messages
1
I'm experiencing the same reboots, although sometimes they are complete freezes. Initially, this server had a Myricom 10g card. It was reboot every few weeks until the card stopped being recognized altogether. Swapped in a Mellanox ConnectX3 10g card back in December and it ran fine until this past Tuesday when it hard locked. Nothing obvious in the log files and updated to 11.3. Hard locked again last night, but had several emails of about a dozen unscheduled reboots every hour or so.
 

agent_kith

Dabbler
Joined
Jan 2, 2014
Messages
15
Are you using iScsi? Using jumbo frames? My system never hard froze before. Is your issue heat related?

So far so good with Chelsio. Running for 19 days (Since the upgrade to 11.3).

As subnetmask said, it's probably hardware related. Just can't see a pattern yet. The common thing is we are all using 10G NICs.

If you can, please supply the chipset of the card, and the BSD network driver.
 

dsilva

Cadet
Joined
Mar 26, 2020
Messages
2
Same problem here.

Reboots on high load.
ISCSI, MTU 9000, HP 10G NIC.

just upgraded to 11.3.
 

velocity08

Dabbler
Joined
Nov 29, 2019
Messages
33
Hi All

have been seeing the same on a brand new supermicro with 10GB, iscsi, 9000 mtu.

freenas 11.3 U1.

just upgraded to U2 If that’s the latest one and the reboots have stopped so far.

initially thinking it’s a hardware issue with ram or MB but not sure now.

has been up for 2 days now and used to reboot every few hours with no load on it just idle to VMware.

Exact same server (We purchased at the same time) apart from drives and on 11.3 Base no updates and hasn’t skipped a beat also has iscsi jumbo frames set to 9000 MTU.

this is looking like a similar issue.

Will cross check the nic’s shortly and report back.

think they are intel but could have a Broadcom chip, will repost back.

ta
 
Joined
Apr 21, 2020
Messages
3
I'm encountering the same issue on FreeNAS-11.3-U2 on a Dell R710 with a Mellanox ConnectX-3 10GbE card at 1500 MTU. I've now seen two expirations of watchdog2 resulting in a system reset spaced days apart, near minimal load times for the system, running only NFS.

Nothing exists in /data/crash, nothing was logged prior to the lockup, and the system has otherwise been stable for years before moving its role to FreeNAS and adding the ConnectX-3. All RAM was swapped between the incidents, so RAM is an unlikely cause.

Presuming we're all experiencing the same issue, this rules out:

- 11.3 U1 vs. 11.3 U2
- 9k MTU
- NIC Model
- iSCSI
- Bad memory

Has anyone with more frequent crashes and easier access to their system tried enabling the debug kernel, disabling watchdogs via ipmitool, and seeing if there's a panic message visible on serial console when this occurs?
 
Top