Kernal Panic for Unknown Reasons

HelloWill

Dabbler
Joined
May 3, 2016
Messages
20
Our Setup
Hardware

  • 45 Drives Turbo 60 XL
  • 60 SATA drives (6 10 disk vdevs raidz2)
  • 8 cache drives
  • 2 ssds for config
  • 1 flash drive with backup config and cache
Software
  • FreeNAS 9.10.2 U1

BACKGROUND
We have a box with 60 hard drives running FreeNAS 9.10.2 U1 that has been working fine for a long time. Over the last few months, we noticed the Active Directory connection would become out of sync, but reconnecting it fixed the issues. I have a backup of the config I took on Saturday when the server was working and accessible.


BEHAVIOR WE'RE EXPERIENCING
We were seeing source code in the status log in the gui. Rebooted the box and everything appeared fine. Shares were accessible, etc. Had somebody connect in to check our firmware / drivers and install in general, and the server stopped responding after querying for the firmware of the HBA cards. When we reboot, it would go to kernal panic.


STEPS TAKEN
  1. After trying to reboot, same symptoms persist
  2. Unplugged one of the SATA boot drives (to leave it intact)
  3. Installed new copy of FreeNAS using 9.10.2 U1 ISO
  4. Configured IP address and logged into the GUI
  5. Imported volume
  6. Applied backup config from Saturday
  7. System kernal panics

KERNAL PANIC MESSAGE
Panic: solaris assert: offset + size <= sm->sm_start + sm->sm_size (0x64060634802000 <= 0x230000000000), file: /freenas-9.10-releng/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line 119
cpuid = 0
KBD: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe201cde9560
kdb_backtrace () at kbd_backtrace+0x39/frame 0xfffffe201cde9610
vpanic() at vpanic+ox126/frame 0xfffffe201cde9650
panic() at panic+0x43/frame 0xfffffe201cde96b0
assfail3() at assfail3+0x43/frame 0xfffffe201cde96d0
space_map_load() at space_map_load+0x372/frame 0xfffffe201cde9750
metaslab_load() at metaslab_load+0x2e/frame 0xfffffe201cde9770
metaslab_alloc() at metaslab_alloc+0x857/frame 0xfffffe201cde98c0
zio_dva_allocate() at zio_dva_allocate+0x85/frame 0xfffffe201cde9990
zio_execute() at zio_execute+0x111/frame 0xfffffe201cde99f0
taskqueue_run_locked() at taskqueue_run_locked+0xe5/frame 0xfffffe201cde9a40
taskqueue_thread_loop() at taskqueue_thread_loop+0xa8/frame 0xfffffe201cde9a70
fork_exit() at fork_exit+0x9a/frame 0xfffffe201cde9ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe201cde9ab0
--- trap 0, rip = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 0 tid 101016 ]
Stopped at kdb_enter+0x3e: movq $0, kdb_why
db>
 
D

dlavigne

Guest
Was the fresh install to the same boot device or a different one? If the same, does a fresh install to a new boot device resolve the issue?
 

HelloWill

Dabbler
Joined
May 3, 2016
Messages
20
It was to the same boot device. I just installed a new SSD drive, and attempting to install to that now
 

HelloWill

Dabbler
Joined
May 3, 2016
Messages
20
Tried on the new SSD. Everything installed fine, and then logged into the GUI and did import volume. Our two volumes showed up, "backup" and "tank". Selected tank and clicked import, and after serveral moments the server rebooted itself.
 

Hoeser

Dabbler
Joined
Sep 23, 2016
Messages
23
I strongly suspect a dying HBA based on the fact that querying the HBAs resulted in a reboot of the system.

Personally, I'd shutdown the 60 drive system and test the HBA's separately (one at a time) in another chassis to rule out source system.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I strongly suspect a dying HBA based on the fact that querying the HBAs resulted in a reboot of the system.

Personally, I'd shutdown the 60 drive system and test the HBA's separately (one at a time) in another chassis to rule out source system.
Not that I disagree, but you should ask what the hardware is. The "45 Drives" Storinator XL60 Turbo we have where I work came with two of the HighPoint Rocket 750 cards like this:
https://www.amazon.com/High-Point-Rocket-750-PCI-Express/dp/B00C7JNPSQ
That used to be the default configuration, but I see they are not even offering that any more.
https://www.45drives.com/products/s...XL60-03&code=XL&software=Default&type=storage
You should ask what hardware is being used first, instead of assuming ..
 
Last edited:

Hoeser

Dabbler
Joined
Sep 23, 2016
Messages
23
Not that I disagree, but you should ask what the hardware is. The "45 Drives" Storinator XL60 Turbo we have where I work came with two of the HighPoint Rocket 750 cards like this:
https://www.amazon.com/High-Point-Rocket-750-PCI-Express/dp/B00C7JNPSQ
That used to be the default configuration, but I see they are not even offering that any more.
https://www.45drives.com/products/s...XL60-03&code=XL&software=Default&type=storage
You should ask what hardware is being used first, instead of assuming ..

Pump the brakes, Chris. I googled it, and according to storinator that particular model claims to come with 4 x LSI 9305... which is why I didn't ask. Also OP stated "HBA cards" - plural - so I figured I was probably correct enough with my research to make my recommendation.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Pump the brakes, Chris. I googled it, and according to storinator that particular model claims to come with 4 x LSI 9305... which is why I didn't ask. Also OP stated "HBA cards" - plural - so I figured I was probably correct enough with my research to make my recommendation.
Please don't take it as a criticism of you. I am trying to share personal experience to help everyone involved.

Just a little more than ONE year ago, the standard configuration of that system was with a pair of the Rocket 750 cards, and 32 drives were connected to one and 28 drives were connected to the other. I have two of those systems in my server room right now. Hate them, by the way. One problem with those Rocket 750 cards is that some people might mistakenly call it an HBA, which it isn't. Looking at the vendor page now, which I also did, you can't even order that configuration any more.
It is always best to ask what the hardware is because it is better to know than to guess.

I agree that the drive controller is the problem, the question in my mind is what drive controller is it.
 

Rowlf

Cadet
Joined
Aug 9, 2019
Messages
2
Hi,

anything new on this topic? Since I'm running into the same error with much simpler machine ;(

Code:
root@freenas[~]# zdb -e -c storage_01

Traversing all blocks to verify metadata checksums and verify nothing leaked ...

loading concrete vdev 0, metaslab 113 of 116 ...Assertion failed: entry_offset < sm->sm_start + sm->sm_size (0x800e200525e000 < 0xe4000000000), file /freenas-11-nightlies/freenas/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line 155.
zsh: abort (core dumped)  zdb -e -c storage_01
 
D

dlavigne

Guest
Hi,

anything new on this topic? Since I'm running into the same error with much simpler machine ;(

Code:
root@freenas[~]# zdb -e -c storage_01

Traversing all blocks to verify metadata checksums and verify nothing leaked ...

loading concrete vdev 0, metaslab 113 of 116 ...Assertion failed: entry_offset < sm->sm_start + sm->sm_size (0x800e200525e000 < 0xe4000000000), file /freenas-11-nightlies/freenas/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line 155.
zsh: abort (core dumped)  zdb -e -c storage_01

Hard to tell without an analysis of the system. Have you tried creating a report at bugs.ixsystems.com that includes your debug?
 

Rowlf

Cadet
Joined
Aug 9, 2019
Messages
2
Hard to tell without an analysis of the system. Have you tried creating a report at bugs.ixsystems.com that includes your debug?
I was able to import the damaged pool in read-only mode and transferred the data. Seems this was caused by incompatible memory in the first place which destroyed one pool. After changing the memory and re-creating the pools everything works fine. Thanks for asking!
 
Top