Kernal Panic for Unknown Reasons

HelloWill · Jan 29, 2019

Our Setup
Hardware

45 Drives Turbo 60 XL
60 SATA drives (6 10 disk vdevs raidz2)
8 cache drives
2 ssds for config
1 flash drive with backup config and cache

Software

FreeNAS 9.10.2 U1

BACKGROUND
We have a box with 60 hard drives running FreeNAS 9.10.2 U1 that has been working fine for a long time. Over the last few months, we noticed the Active Directory connection would become out of sync, but reconnecting it fixed the issues. I have a backup of the config I took on Saturday when the server was working and accessible.

BEHAVIOR WE'RE EXPERIENCING
We were seeing source code in the status log in the gui. Rebooted the box and everything appeared fine. Shares were accessible, etc. Had somebody connect in to check our firmware / drivers and install in general, and the server stopped responding after querying for the firmware of the HBA cards. When we reboot, it would go to kernal panic.

STEPS TAKEN

After trying to reboot, same symptoms persist
Unplugged one of the SATA boot drives (to leave it intact)
Installed new copy of FreeNAS using 9.10.2 U1 ISO
Configured IP address and logged into the GUI
Imported volume
Applied backup config from Saturday
System kernal panics

KERNAL PANIC MESSAGE
Panic: solaris assert: offset + size <= sm->sm_start + sm->sm_size (0x64060634802000 <= 0x230000000000), file: /freenas-9.10-releng/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line 119
cpuid = 0
KBD: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe201cde9560
kdb_backtrace () at kbd_backtrace+0x39/frame 0xfffffe201cde9610
vpanic() at vpanic+ox126/frame 0xfffffe201cde9650
panic() at panic+0x43/frame 0xfffffe201cde96b0
assfail3() at assfail3+0x43/frame 0xfffffe201cde96d0
space_map_load() at space_map_load+0x372/frame 0xfffffe201cde9750
metaslab_load() at metaslab_load+0x2e/frame 0xfffffe201cde9770
metaslab_alloc() at metaslab_alloc+0x857/frame 0xfffffe201cde98c0
zio_dva_allocate() at zio_dva_allocate+0x85/frame 0xfffffe201cde9990
zio_execute() at zio_execute+0x111/frame 0xfffffe201cde99f0
taskqueue_run_locked() at taskqueue_run_locked+0xe5/frame 0xfffffe201cde9a40
taskqueue_thread_loop() at taskqueue_thread_loop+0xa8/frame 0xfffffe201cde9a70
fork_exit() at fork_exit+0x9a/frame 0xfffffe201cde9ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe201cde9ab0
--- trap 0, rip = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 0 tid 101016 ]
Stopped at kdb_enter+0x3e: movq $0, kdb_why
db>

dlavigne · Jan 29, 2019

Was the fresh install to the same boot device or a different one? If the same, does a fresh install to a new boot device resolve the issue?

HelloWill · Jan 29, 2019

It was to the same boot device. I just installed a new SSD drive, and attempting to install to that now

HelloWill · Jan 29, 2019

Tried on the new SSD. Everything installed fine, and then logged into the GUI and did import volume. Our two volumes showed up, "backup" and "tank". Selected tank and clicked import, and after serveral moments the server rebooted itself.

Hoeser · Jan 29, 2019

I strongly suspect a dying HBA based on the fact that querying the HBAs resulted in a reboot of the system.

Personally, I'd shutdown the 60 drive system and test the HBA's separately (one at a time) in another chassis to rule out source system.

Chris Moore · Jan 29, 2019

HelloWill said:
8 cache drives

What cache drives? Exactly.
What drive controller (disk interface)?

Chris Moore · Jan 29, 2019

Hoeser said:
I strongly suspect a dying HBA based on the fact that querying the HBAs resulted in a reboot of the system.

Personally, I'd shutdown the 60 drive system and test the HBA's separately (one at a time) in another chassis to rule out source system.

Not that I disagree, but you should ask what the hardware is. The "45 Drives" Storinator XL60 Turbo we have where I work came with two of the HighPoint Rocket 750 cards like this:
https://www.amazon.com/High-Point-Rocket-750-PCI-Express/dp/B00C7JNPSQ
That used to be the default configuration, but I see they are not even offering that any more.
https://www.45drives.com/products/s...XL60-03&code=XL&software=Default&type=storage
You should ask what hardware is being used first, instead of assuming ..

Hoeser · Jan 30, 2019

Chris Moore said:
Not that I disagree, but you should ask what the hardware is. The "45 Drives" Storinator XL60 Turbo we have where I work came with two of the HighPoint Rocket 750 cards like this:
https://www.amazon.com/High-Point-Rocket-750-PCI-Express/dp/B00C7JNPSQ
That used to be the default configuration, but I see they are not even offering that any more.
https://www.45drives.com/products/s...XL60-03&code=XL&software=Default&type=storage
You should ask what hardware is being used first, instead of assuming ..

Pump the brakes, Chris. I googled it, and according to storinator that particular model claims to come with 4 x LSI 9305... which is why I didn't ask. Also OP stated "HBA cards" - plural - so I figured I was probably correct enough with my research to make my recommendation.

Chris Moore · Jan 30, 2019

Hoeser said:
Pump the brakes, Chris. I googled it, and according to storinator that particular model claims to come with 4 x LSI 9305... which is why I didn't ask. Also OP stated "HBA cards" - plural - so I figured I was probably correct enough with my research to make my recommendation.

Please don't take it as a criticism of you. I am trying to share personal experience to help everyone involved.

Just a little more than ONE year ago, the standard configuration of that system was with a pair of the Rocket 750 cards, and 32 drives were connected to one and 28 drives were connected to the other. I have two of those systems in my server room right now. Hate them, by the way. One problem with those Rocket 750 cards is that some people might mistakenly call it an HBA, which it isn't. Looking at the vendor page now, which I also did, you can't even order that configuration any more.
It is always best to ask what the hardware is because it is better to know than to guess.

I agree that the drive controller is the problem, the question in my mind is what drive controller is it.

Rowlf · Aug 10, 2019

Hi,

anything new on this topic? Since I'm running into the same error with much simpler machine ;(

Code:

root@freenas[~]# zdb -e -c storage_01

Traversing all blocks to verify metadata checksums and verify nothing leaked ...

loading concrete vdev 0, metaslab 113 of 116 ...Assertion failed: entry_offset < sm->sm_start + sm->sm_size (0x800e200525e000 < 0xe4000000000), file /freenas-11-nightlies/freenas/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line 155.
zsh: abort (core dumped)  zdb -e -c storage_01

dlavigne · Aug 20, 2019

Rowlf said:

Hard to tell without an analysis of the system. Have you tried creating a report at bugs.ixsystems.com that includes your debug?

Rowlf · Aug 21, 2019

dlavigne said:
Hard to tell without an analysis of the system. Have you tried creating a report at bugs.ixsystems.com that includes your debug?

I was able to import the damaged pool in read-only mode and transferred the data. Seems this was caused by incompatible memory in the first place which destroyed one pool. After changing the memory and re-creating the pools everything works fine. Thanks for asking!

Important Announcement for The TrueNAS Community.

Kernal Panic for Unknown Reasons

HelloWill

Dabbler

dlavigne

Guest

HelloWill

Dabbler

HelloWill

Dabbler

Hoeser

Dabbler

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

Hoeser

Dabbler

Chris Moore

Hall of Famer

Rowlf

Cadet

dlavigne

Guest

Rowlf

Cadet

Similar threads

Important Announcement for The TrueNAS Community.