Unauthorized system reboot - Kernel Panic - 11.1-U4

Brentnall · Jun 23, 2018

FreeNAS 11.1-U4

Thought I would post this here in the hope that someone might make greater sense of it than me.

I just received an email from one of my FreeNAS boxes advising that an "Unauthorized system reboot" had taken place.

This is a fairly long running system (years) without previous incidents.

Looking at the messages log below it clearly seems to be some issue with one of the LSI 9211-8i HBA Cards in the box, of which there are 3.

Updating the firmware to the latest version is of course the first port of call and will be doing that as soon as possible.

Any ideas if that'll likely prevent it happening again or could this be hardware that is on it's way out and needs replacing?

 Jun 23 21:50:51 roflnas syslog-ng[1918]: syslog-ng starting up; version='3.7.3'

Jun 23 21:50:51 roflnas mps2: IOC Fault 0x40002651, Resetting

Jun 23 21:50:51 roflnas mps2: Reinitializing controller,

Jun 23 21:50:51 roflnas mps2: Firmware: 20.00.04.00, Driver: 21.02.00.00-fbsd

Jun 23 21:50:51 roflnas mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576

Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576

Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576

Jun 23 21:50:51 roflnas mps2: Calling Reinit from mps_wait_command, timeout=60, elapsed=61

Jun 23 21:50:51 roflnas mps2: Reinitializing controller,

Jun 23 21:50:51 roflnas mps2: Firmware: 20.00.04.00, Driver: 21.02.00.00-fbsd

Jun 23 21:50:51 roflnas mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576

Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576

Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576

Jun 23 21:50:51 roflnas mps2: Calling Reinit from mps_wait_command, timeout=60, elapsed=61

Jun 23 21:50:51 roflnas mps2: Reinitializing controller,

Jun 23 21:50:51 roflnas mps2: Firmware: 20.00.04.00, Driver: 21.02.00.00-fbsd

Jun 23 21:50:51 roflnas mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576

Jun 23 21:50:51 roflnas panic: solaris assert: illumos_nvlist_pack(nvl, &packed, sizep, 0, 0x0002) == 0 (0xe == 0x0), file: /freenas-11-releng/freenas/_BE/os/sys/cddl/contrib/opensolaris/common/nvpair/opensolaris_fnvpair.c, line: 84

Jun 23 21:50:51 roflnas cpuid = 2

Jun 23 21:50:51 roflnas KDB: stack backtrace:

Jun 23 21:50:51 roflnas db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0566ec65e0

Jun 23 21:50:51 roflnas vpanic() at vpanic+0x186/frame 0xfffffe0566ec6660

Jun 23 21:50:51 roflnas panic() at panic+0x43/frame 0xfffffe0566ec66c0

Jun 23 21:50:51 roflnas assfail3() at assfail3+0x2c/frame 0xfffffe0566ec66e0

Jun 23 21:50:51 roflnas fnvlist_pack() at fnvlist_pack+0x50/frame 0xfffffe0566ec6700

Jun 23 21:50:51 roflnas zfs_ioc_pool_stats() at zfs_ioc_pool_stats+0x52/frame 0xfffffe0566ec6740

Jun 23 21:50:51 roflnas zfsdev_ioctl() at zfsdev_ioctl+0x6d5/frame 0xfffffe0566ec67e0

Jun 23 21:50:51 roflnas devfs_ioctl_f() at devfs_ioctl_f+0x128/frame 0xfffffe0566ec6840

Jun 23 21:50:51 roflnas kern_ioctl() at kern_ioctl+0x281/frame 0xfffffe0566ec68b0

Jun 23 21:50:51 roflnas sys_ioctl() at sys_ioctl+0x15c/frame 0xfffffe0566ec6980

Jun 23 21:50:51 roflnas amd64_syscall() at amd64_syscall+0xa4a/frame 0xfffffe0566ec6ab0

Jun 23 21:50:51 roflnas Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0566ec6ab0

Jun 23 21:50:51 roflnas --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80196722a, rsp = 0x7fffdd7e6278, rbp = 0x7fffdd7e62f0 ---

Jun 23 21:50:51 roflnas KDB: enter: panic

Hardware wise this is a VM running on ESXi 6.7 with the three LSI cards passed through using VT-d.

There are three RAIDZ2 pools in the box each comprising of 8 drives (one vdev per pool), one pool per HBA. All WD Red drives.

Full specs are in my signature. 20gb RAM assigned to FreeNAS VM which is fully reserved. No dedupe or any other crazy config.

I have a copy of the /var/log directory within 15 mins of the reboot taking place in case it's helpful.

Many thanks for any insight you can share!

Spearfoot · Jun 23, 2018

Brentnall said:
FreeNAS 11.1-U4

Thought I would post this here in the hope that someone might make greater sense of it than me.

I just received an email from one of my FreeNAS boxes advising that an "Unauthorized system reboot" had taken place.

This is a fairly long running system (years) without previous incidents.

Looking at the messages log below it clearly seems to be some issue with one of the LSI 9211-8i HBA Cards in the box, of which there are 3.

Updating the firmware is of course the first port of call and will be doing that as soon as possible, as its a major revision behind the driver version.

Any ideas if that'll likely prevent it happening again or could this be hardware that is on it's way out and needs replacing?

I have a copy of the /var/log directory within 15 mins of the reboot taking place in case it's helpful.

Many thanks for any insight you can share!

Jun 23 21:50:51 roflnas syslog-ng[1918]: syslog-ng starting up; version='3.7.3' Jun 23 21:50:51 roflnas mps2: IOC Fault 0x40002651, Resetting Jun 23 21:50:51 roflnas mps2: Reinitializing controller, Jun 23 21:50:51 roflnas mps2: Firmware: 20.00.04.00, Driver: 21.02.00.00-fbsd Jun 23 21:50:51 roflnas mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576 Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576 Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576 Jun 23 21:50:51 roflnas mps2: Calling Reinit from mps_wait_command, timeout=60, elapsed=61 Jun 23 21:50:51 roflnas mps2: Reinitializing controller, Jun 23 21:50:51 roflnas mps2: Firmware: 20.00.04.00, Driver: 21.02.00.00-fbsd Jun 23 21:50:51 roflnas mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576 Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576 Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576 Jun 23 21:50:51 roflnas mps2: Calling Reinit from mps_wait_command, timeout=60, elapsed=61 Jun 23 21:50:51 roflnas mps2: Reinitializing controller, Jun 23 21:50:51 roflnas mps2: Firmware: 20.00.04.00, Driver: 21.02.00.00-fbsd Jun 23 21:50:51 roflnas mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> Jun 23 21:50:51 roflnas swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577258, size: 24576 Jun 23 21:50:51 roflnas panic: solaris assert: illumos_nvlist_pack(nvl, &packed, sizep, 0, 0x0002) == 0 (0xe == 0x0), file: /freenas-11-releng/freenas/_BE/os/sys/cddl/contrib/opensolaris/common/nvpair/opensolaris_fnvpair.c, line: 84 Jun 23 21:50:51 roflnas cpuid = 2 Jun 23 21:50:51 roflnas KDB: stack backtrace: Jun 23 21:50:51 roflnas db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0566ec65e0 Jun 23 21:50:51 roflnas vpanic() at vpanic+0x186/frame 0xfffffe0566ec6660 Jun 23 21:50:51 roflnas panic() at panic+0x43/frame 0xfffffe0566ec66c0 Jun 23 21:50:51 roflnas assfail3() at assfail3+0x2c/frame 0xfffffe0566ec66e0 Jun 23 21:50:51 roflnas fnvlist_pack() at fnvlist_pack+0x50/frame 0xfffffe0566ec6700 Jun 23 21:50:51 roflnas zfs_ioc_pool_stats() at zfs_ioc_pool_stats+0x52/frame 0xfffffe0566ec6740 Jun 23 21:50:51 roflnas zfsdev_ioctl() at zfsdev_ioctl+0x6d5/frame 0xfffffe0566ec67e0 Jun 23 21:50:51 roflnas devfs_ioctl_f() at devfs_ioctl_f+0x128/frame 0xfffffe0566ec6840 Jun 23 21:50:51 roflnas kern_ioctl() at kern_ioctl+0x281/frame 0xfffffe0566ec68b0 Jun 23 21:50:51 roflnas sys_ioctl() at sys_ioctl+0x15c/frame 0xfffffe0566ec6980 Jun 23 21:50:51 roflnas amd64_syscall() at amd64_syscall+0xa4a/frame 0xfffffe0566ec6ab0 Jun 23 21:50:51 roflnas Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0566ec6ab0 Jun 23 21:50:51 roflnas --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80196722a, rsp = 0x7fffdd7e6278, rbp = 0x7fffdd7e62f0 --- Jun 23 21:50:51 roflnas KDB: enter: panic

Hardware wise this is a VM running on ESXi with the three LSI cards passed through using VT-d.
Specs in sig. 20gb RAM assigned to FreeNAS VM, no dedupe or any other crazy config.

Odd that your system would reboot after years of service. In any case, there's quite a bit of material here on running FreeNAS as a VM under ESXi.

What version of ESXi are you using?

When you mention updating your firmware, I assume you mean updating the LSI HBA firmware to P20.00.07.00 IT mode?

One bit if advice I will offer is to make sure you reserve all of the FreeNAS guest VM memory.

You'll have to post quite a bit more detailed information about your hardware and pool setup before we can offer more targeted advice.

Good luck!

Brentnall · Jun 23, 2018

Many thanks for the reply Spearfoot!
Apologies for the missing info, will update the original post with those details.

ESXi version is 6.7
The memory for the FreeNAS VM is indeed fully reserved and always has been.

There are three RAIDZ2 pools in the box each comprising of 8 drives (one vdev per pool), one pool per HBA.
8 x 8TB RAIDZ2 | 8 x 4TB RAIDZ2 | 8 x 4TB RAIDZ2
All WD Red drives.

Unfortunately the lines from the messages log doesn't seem to provide any hints as to which HBA or pool it was having problems with, that would be a good start.

With regard to the LSI HBA firmware, honestly I hadn't checked what is available yet and just assumed that something newer was out since these were last updated. I also assumed that the driver version numbers were roughly mirroring the firmware versions but that doesn't seem to be the case anymore.

What other hardware info are you looking for that isn't already in my signature?

Spearfoot · Jun 23, 2018

Brentnall said:
Many thanks for the reply Spearfoot!
Apologies for the missing info, will update the original post with those details.

ESXi version is 6.7
The memory for the FreeNAS VM is indeed fully reserved and always has been.

There are three RAIDZ2 pools in the box each comprising of 8 drives (one vdev per pool), one pool per HBA.
8 x 8TB RAIDZ2 | 8 x 4TB RAIDZ2 | 8 x 4TB RAIDZ2
All WD Red drives.

Unfortunately the lines from the messages log doesn't seem to provide any hints as to which HBA or pool it was having problems with, that would be a good start.

With regard to the LSI HBA firmware, honestly I hadn't checked what is available yet and just assumed that something newer was out since these were last updated. I also assumed that the driver version numbers were roughly mirroring the firmware versions but that doesn't seem to be the case any more.

What other hardware info are you looking for that isn't already in my signature?

Well, everything mentioned in the forum rules:

https://forums.freenas.org/index.php?threads/updated-forum-rules-4-11-17.45124/

How often do you scrub the three pools? The output from zpool status enclosed in code blocks will show quite a bit of info about your pools.

How often do you run the short and long SMART tests on your drives? Are all of your drives healthy?

You said that you've been running this system for years, but version 11.1-U4 was only released a few months ago... did the unexpected reboot take place immediately after upgrading to this version? Or had it instead been running for days or weeks before rebooting? Perhaps your system simply isn't stable under the latest version of FreeNAS and you'll need to revert? Doesn't seem likely given what we know about your hardware, but "stranger things have been known to happen".

Have you made any hardware changes recently?

Details like these often provide a clue. It's hard to provide too much information! :)

Chris Moore · Jun 24, 2018

Brentnall said:
20gb RAM assigned to FreeNAS VM which is fully reserved. No dedupe or any other crazy config.

Not enough memory assigned to FreeNAS.

Brentnall · Jun 24, 2018

Chris Moore said:
Not enough memory assigned to FreeNAS.

It's a fair point to raise with this quantity of storage.

That said, I don't run VMware datastores or anything that hits FreeNAS hard on this box. No dedupe, Plugins, Jails or VMs running within FreeNAS.
95% of the workload is media streaming or backup storage.
Scrubs and backups are all well spaced out and scheduled to start in the early hours of the morning which doesn't match up with the timeframe of the panic.
Basically the system was almost idle when the panic occurred save for a few low bitrate media streams being read from the pools in small chunks.

I'll go ahead and push the RAM assignment up at the same time as upgrading the firmware on the cards and then monitor for any reoccurrence.

Spearfoot said:
How often do you scrub the three pools? The output from zpool status enclosed in code blocks will show quite a bit of info about your pools.

How often do you run the short and long SMART tests on your drives? Are all of your drives healthy?

The pools are scrubbed monthly, never had any problems or errors on that front. zpool status output below.

The drives are all healthy currently. SMART tests are configured for short test every 2 days and long once a month.

Since the system came back up I have manually started long SMART tests on all drives and intend to scrub all three pools manually too assuming the SMART tests all come back clean.
Will post if anything comes up from these tests but I don't expect anything to.

Spearfoot said:
You said that you've been running this system for years, but version 11.1-U4 was only released a few months ago... did the unexpected reboot take place immediately after upgrading to this version? Or had it instead been running for days or weeks before rebooting? Perhaps your system simply isn't stable under the latest version of FreeNAS and you'll need to revert? Doesn't seem likely given what we know about your hardware, but "stranger things have been known to happen".

Have you made any hardware changes recently?

Hardware wise the only change within the last year was NIC cards. These are configured an abstraction layer below FreeNAS however, at the ESXi level, and are not passed through to the FreeNAS VM using VT-d.
To be specific I had a failing Intel PRO/1000 PT Quad Port card that suddenly started suffering from link drops and unstable connectivity towards the end of May. Swapping this out for a new one of the same model has fully resolved those issues.
At the same time I also added a MELLANOX 10GBE CONNECTX-3 MCX312A-XCBT card.

Intel PT Quad card is a PCI-E 1.0 x4 device, sitting in a PCI-E 2.0 x4 slot.
10GBE card is PCI-E 3.0 x8 sitting in a PCI-E 3.0 x16 slot.
The LSI 9211 cards are PCI-E 2.0 x8, two of which are sitting in PCI-E 3.0 x8 slots and the other in PCI-E 3.0 x4 (which should provide enough bandwidth to satisfy PCI-E 2.0 x8).
The final slot on the board is a PCI-E 2.0 x2 which is empty/not in use.

I cannot find any evidence that this Supermicro X10SRI-F board will drop the speed of any of these ports when others are in use.

Over on the software side of things you're right, things have been changing here fairly regularly. I usually upgrade FreeNAS versions within a month of release. U4 was applied on the 25th April.
ESXi I usually only keep updated on major version releases. ESXi 6.7 was released in the middle of April and was applied to this system very soon thereafter.

So all in all, I would agree there have actually been some significant changes over the last 2 months... certainly compared to the year prior.
That said, the last time the system was touched was on the 25th May to upgrade FreeNAS to U4, so was running fine for a month without issues.

As you can see given the workload, its not a super critical server and one 10 minute period of downtime as a result of a panic is not the end of the world.
I just wanted to post here in case anyone recognised the errors as something specific, such as one of the HBA cards slowly dying for example.
If that's not the case then I figure without a way of reproducing the panic it's unlikely we'll be able to pin this down with much precision.

Code:

  pool: MOVIES1_4TBx8RZ2
 state: ONLINE
  scan: scrub repaired 0 in 0 days 09:03:10 with 0 errors on Sun Jun 10 10:03:14 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		MOVIES1_4TBx8RZ2								ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/40222b80-4d62-11e5-9174-000c2923118c  ONLINE	   0	 0	 0
			gptid/410a4c14-4d62-11e5-9174-000c2923118c  ONLINE	   0	 0	 0
			gptid/41e38737-4d62-11e5-9174-000c2923118c  ONLINE	   0	 0	 0
			gptid/42c4d1a0-4d62-11e5-9174-000c2923118c  ONLINE	   0	 0	 0
			gptid/439fd623-4d62-11e5-9174-000c2923118c  ONLINE	   0	 0	 0
			gptid/c8c703c2-4f45-11e5-a98c-000c2923118c  ONLINE	   0	 0	 0
			gptid/4556a44c-4d62-11e5-9174-000c2923118c  ONLINE	   0	 0	 0
			gptid/46368bfa-4d62-11e5-9174-000c2923118c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: REPO4TBx8RZ2
 state: ONLINE
  scan: scrub repaired 0 in 0 days 07:02:23 with 0 errors on Sat Jun  9 08:02:26 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		REPO4TBx8RZ2									ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/4fe52da9-60ce-11e7-a014-000c2923118c  ONLINE	   0	 0	 0
			gptid/51343461-60ce-11e7-a014-000c2923118c  ONLINE	   0	 0	 0
			gptid/528e366e-60ce-11e7-a014-000c2923118c  ONLINE	   0	 0	 0
			gptid/53e4d71f-60ce-11e7-a014-000c2923118c  ONLINE	   0	 0	 0
			gptid/553a964e-60ce-11e7-a014-000c2923118c  ONLINE	   0	 0	 0
			gptid/569613d5-60ce-11e7-a014-000c2923118c  ONLINE	   0	 0	 0
			gptid/57ec8256-60ce-11e7-a014-000c2923118c  ONLINE	   0	 0	 0
			gptid/5944941f-60ce-11e7-a014-000c2923118c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: TV_8TBx8RZ2
 state: ONLINE
  scan: scrub repaired 0 in 0 days 08:47:01 with 0 errors on Tue Jun  5 09:47:03 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		TV_8TBx8RZ2									 ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/6363ca5d-5b69-11e7-b619-000c2923118c  ONLINE	   0	 0	 0
			gptid/6406d024-5b69-11e7-b619-000c2923118c  ONLINE	   0	 0	 0
			gptid/64a7f82b-5b69-11e7-b619-000c2923118c  ONLINE	   0	 0	 0
			gptid/6550309a-5b69-11e7-b619-000c2923118c  ONLINE	   0	 0	 0
			gptid/65f8b9f9-5b69-11e7-b619-000c2923118c  ONLINE	   0	 0	 0
			gptid/66a0051e-5b69-11e7-b619-000c2923118c  ONLINE	   0	 0	 0
			gptid/67474df3-5b69-11e7-b619-000c2923118c  ONLINE	   0	 0	 0
			gptid/67f226d0-5b69-11e7-b619-000c2923118c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:18 with 0 errors on Fri Jun 22 03:45:18 2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da0p2	 ONLINE	   0	 0	 0

errors: No known data errors

toadman · Jun 24, 2018

If you are running SMB or SNMP there is a memory leak in 11.1-U4. SMB was fixed in 11.1-U5, but SNMP is still broken in that version.

Brentnall · Jun 25, 2018

Thanks toadman I was not aware of the SMB memory leak and hadn't read the U5 changelog in detail yet.

I stick to SMTP and an external syslog server for alerting so do not have SNMP enabled.
SMB however is definitely in use on the box!

Have been meaning to upgrade it to U5 anyway, will do so as soon as possible.

Important Announcement for the TrueNAS Community.

Unauthorized system reboot - Kernel Panic - 11.1-U4

Brentnall

Dabbler

Spearfoot

He of the long foot

Brentnall

Dabbler

Spearfoot

He of the long foot

Chris Moore

Hall of Famer

Brentnall

Dabbler

toadman

Guru

Brentnall

Dabbler

Similar threads