Kernel Panic during ZFS Pool Scrub

andreasg

Cadet
Joined
Nov 12, 2022
Messages
4
Hi,

First, my configuration:
VMWARE ESXI virtualized VM
Running on 2 vCPU with dedicated 16GB RAM
Got 60GB of allocated HDD Space on SSD
Two 12TB HDDs, with mirrored ZFS - Pool, generated within TrueNAS

My Problem,
The above configuration was running Great with an FreeNAS installation. Then i discoverd, that the VM is in an running reset state, which is triggerd out of an kernel Panic. After some research, i reinstallt TrueNAS 13.0 U3. The running reste was gone. I imported the ZFS-Pool resting on both HDDs. The Shell says, i shall run an scurb, because it has lost the last 5 Secounds of data, but the data is till this point consistent.
But after the Pool was online again, the running reset is back again. I disconnected the pool drives harnesses, and the reset disappeard again.

Any ideas, how i can dig into the problem? I think, as soon the scrubbing is done, the running reset should also dissapear, but how can i finish the scrubbing, when the kernel boots again and again?
 

Attachments

  • Screenshot 2022-11-12 203513.png
    Screenshot 2022-11-12 203513.png
    266.8 KB · Views: 111
Joined
Jan 7, 2015
Messages
1,155
Try to reseat your RAM first by physically removing and air dusting the modules and the slots. And firmly reseating the RAM. If the problem persists Id do a full day of memtest.
 

andreasg

Cadet
Joined
Nov 12, 2022
Messages
4
Hi,
i ran over 24h a RAM memtest. Nothing found, all works fine. There are running 3 other vm's without any findings.

I tried today to reinstall TrueNAS again. I were able to import the pool and perform an resilvering. It mentioned, that there was one Minor error. it ran, till i started a scrub task. It crashes immediately. Since then, I'm in the continuous kernel panic reboot situation.

Any idea what i can do about that? Scrubbing is standard process within ZFS i assume.
 
Joined
Jan 7, 2015
Messages
1,155
Does this happen at any other time ever?

Id recommend test all the disks in the host. Make sure those are absolutely good too. smartctl -t long /dev/daX

Your in a virtual environment there could be some other issue too, im no pro at it. But almost certainly a hardware issue somewhere. Test what you can.
 

andreasg

Cadet
Joined
Nov 12, 2022
Messages
4
Hi,

i ran some tests of the vmdk - Files and the HDDs on which they are running. No observertion.

I imported the pool again, and suspended the scrub prozess right behind --> no problem. no kernel panic. After Start it again, kernel panic. But now i managed, to brows through the log files. The fault is triggered by an Inter divide in kernel mode, running scrub.

Dec 12 01:03:26 truenas 1 2022-12-12T01:03:26.936825-08:00 truenas.local smartd1134 - - Configuration file /usr/local/etc/smartd.conf parsed but has no entries Dec 12 01:03:27 truenas 1 2022-12-12T01:03:27.128188-08:00 truenas.local daemon1118 - - 2022-12-12 01:03:27,121:wsdd WARNING(pid 1119): no interface given, using all interfaces Dec 12 01:03:32 truenas 1 2022-12-12T01:03:32.323637-08:00 truenas.local daemon1281 - - 2022-12-12 01:03:32,322:wsdd WARNING(pid 1282): no interface given, using all interfaces Dec 12 01:27:53 truenas syslog-ng[990]: syslog-ng starting up; version='3.35.1' Dec 12 01:27:53 truenas Fatal trap 18: integer divide fault while in kernel mode Dec 12 01:27:53 truenas cpuid = 0; apic id = 00 Dec 12 01:27:53 truenas instruction pointer = 0x20:0xffffffff828f2795 Dec 12 01:27:53 truenas stack pointer = 0x28:0xfffffe0111399050 Dec 12 01:27:53 truenas frame pointer = 0x28:0xfffffe0111399060 Dec 12 01:27:53 truenas code segment = base 0x0, limit 0xfffff, type0x1b Dec 12 01:27:53 truenas = DPL 0, pres 1, long 1, def32 0, gran 1 Dec 12 01:27:53 truenas processor eflags = interrupt enabled, resume, IOPL = 0 Dec 12 01:27:53 truenas current process = 6 (txg_thread_enter) Dec 12 01:27:53 truenas trap number = 18 Dec 12 01:27:53 truenas panic: integer divide fault Dec 12 01:27:53 truenas cpuid = 0 Dec 12 01:27:53 truenas time = 1670837208 Dec 12 01:27:53 truenas KDB: stack backtrace: Dec 12 01:27:53 truenas db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0111398e70 Dec 12 01:27:53 truenas vpanic() at vpanic+0x17f/frame 0xfffffe0111398ec0 Dec 12 01:27:53 truenas panic() at panic+0x43/frame 0xfffffe0111398f20 Dec 12 01:27:53 truenas trap_fatal() at trap_fatal+0x385/frame 0xfffffe0111398f80 Dec 12 01:27:53 truenas calltrap() at calltrap+0x8/frame 0xfffffe0111398f80 Dec 12 01:27:53 truenas --- trap 0x12, rip = 0xffffffff828f2795, rsp = 0xfffffe0111399050, rbp = 0xfffffe0111399060 --- Dec 12 01:27:53 truenas ext_size_add() at ext_size_add+0x35/frame 0xfffffe0111399060 Dec 12 01:27:53 truenas range_tree_add_impl() at range_tree_add_impl+0x12f5/frame 0xfffffe0111399150 Dec 12 01:27:53 truenas scan_io_queue_insert_impl() at scan_io_queue_insert_impl+0xa7/frame 0xfffffe0111399190 Dec 12 01:27:53 truenas dsl_scan_scrub_cb() at dsl_scan_scrub_cb+0x6c4/frame 0xfffffe0111399210 Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x4b0/frame 0xfffffe01113992c0 Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x663/frame 0xfffffe0111399370 Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe0111399420 Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe01113994d0 Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe0111399580 Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe0111399630 Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe01113996e0 Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x332/frame 0xfffffe0111399790 Dec 12 01:27:53 truenas dsl_scan_visit_rootbp() at dsl_scan_visit_rootbp+0x12f/frame 0xfffffe01113997e0 Dec 12 01:27:53 truenas dsl_scan_visitds() at dsl_scan_visitds+0xc0/frame 0xfffffe0111399990 Dec 12 01:27:53 truenas dsl_scan_visit() at dsl_scan_visit+0x1f6/frame 0xfffffe0111399b80 Dec 12 01:27:53 truenas dsl_scan_sync() at dsl_scan_sync+0xc08/frame 0xfffffe0111399bf0 Dec 12 01:27:53 truenas spa_sync() at spa_sync+0xaf9/frame 0xfffffe0111399e20 Dec 12 01:27:53 truenas txg_sync_thread() at txg_sync_thread+0x30e/frame 0xfffffe0111399ef0 Dec 12 01:27:53 truenas fork_exit() at fork_exit+0x7e/frame 0xfffffe0111399f30 Dec 12 01:27:53 truenas fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0111399f30 Dec 12 01:27:53 truenas --- trap 0, rip = 0xffffffff80aa32cf, rsp = 0, rbp = 0xffffffff832fffa0 --- Dec 12 01:27:53 truenas mi_startup() at mi_startup+0xdf/frame 0xffffffff832fffa0 Dec 12 01:27:53 truenas swapper() at swapper+0x69/frame 0xffffffff832ffff0 Dec 12 01:27:53 truenas btext() at btext+0x22 Dec 12 01:27:53 truenas KDB: enter: panic Dec 12 01:27:53 truenas ---<<BOOT>>---
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
But almost certainly a hardware issue somewhere.

I wouldn't bet on that. Well, actually I would, but probably not the way you mean it.

Corruptions in hypervisor environments are unfortunately somewhat common, because hypervisors have a chilling tendency to reorder data which causes ZFS to lose its marbles. This is why we have a very specific, very clear set of guidelines for virtualizing TrueNAS.

I'm going to make a guess that at least one of the following things is true:

1) These disks are not directly attached to the TrueNAS VM using PCIe passthru and an LSI HBA,

2) These are vmdk disks that are situated on a VMFS datastore that lacks controller level redundancy (i.e. JBOD or RAID0),

3) The controller that runs these disks is a RAID controller with write caching

The pool is probably toast. Your best bet is to mount it read-only, and copy all the data that you can off of it. There is a good chance that it may crash during this process. That is what backups are for.

Then you will want to review the guide I posted nearly ten years ago on how to properly virtualize TrueNAS.

 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Also, thread moved to the virtualization forum, since this really has very little to do with TrueNAS itself, and looks to be just a victim of improper virtualization.
 

andreasg

Cadet
Joined
Nov 12, 2022
Messages
4
I wouldn't bet on that. Well, actually I would, but probably not the way you mean it.

Corruptions in hypervisor environments are unfortunately somewhat common, because hypervisors have a chilling tendency to reorder data which causes ZFS to lose its marbles. This is why we have a very specific, very clear set of guidelines for virtualizing TrueNAS.

I'm going to make a guess that at least one of the following things is true:

1) These disks are not directly attached to the TrueNAS VM using PCIe passthru and an LSI HBA,

2) These are vmdk disks that are situated on a VMFS datastore that lacks controller level redundancy (i.e. JBOD or RAID0),

3) The controller that runs these disks is a RAID controller with write caching

The pool is probably toast. Your best bet is to mount it read-only, and copy all the data that you can off of it. There is a good chance that it may crash during this process. That is what backups are for.

Then you will want to review the guide I posted nearly ten years ago on how to properly virtualize TrueNAS.

Hi Jgreco,

Your right, the disks are not directly attached, and they are running in mirrored mode within TrueNAS. No RAID at Hardware level. I knew the possibility of PCIe passthru, but i used this in a different content. To use it also for the storage simply doesn't came into my mind. I ran an HPE Mircoserver Gen10. There is also the possibility to pass marvels SATA - Controller through.

I managed to SFTP into the storage. The most important files were backup already elsewhere beforehand, but I have some carefull collected (and obsolete) SD-Card Images, to big to backup in a proper way.

I was a bit dissatisfied with my configuration anyway as I never saw the S.M.A.R.T status of the drives --> Failure was always an option. Because of that, I checked it elsewhere and replaced the drives it as soon as something were read out curios.

I assume, with your idea, i see also this status within TrueNAS.

P.S.: I might got the point with your USB - Stick Handling not right...
1. Install TrueNAS as OS onto the First USB - Stick running no ESXI, and configure it as intended,
2. Use secound USB, install ESXI on that, do the ESXI stuff.
3. Pass through the SATA - Controller
4. Configure TrueNAS VM, and import Configuration from the first USB - Stick.
--> Boot the VM from the USB - Stick, or just import the configuration?
--> When boot from USB the VM how's the performance? I Usally boot the VMs from an NMVe SSD...

Regards
Andreas
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
carefull collected (and obsolete) SD-Card Images, to big to backup in a proper way.
What size are we talking about? Anything up to 20 TB is certainly not a huge challenge these days with a couple (for multiple generations and redundancy) of USB HDDs.
 
Top