Kernel Panic during ZFS Pool Scrub

andreasg · Nov 12, 2022

Hi,

First, my configuration:
VMWARE ESXI virtualized VM
Running on 2 vCPU with dedicated 16GB RAM
Got 60GB of allocated HDD Space on SSD
Two 12TB HDDs, with mirrored ZFS - Pool, generated within TrueNAS

My Problem,
The above configuration was running Great with an FreeNAS installation. Then i discoverd, that the VM is in an running reset state, which is triggerd out of an kernel Panic. After some research, i reinstallt TrueNAS 13.0 U3. The running reste was gone. I imported the ZFS-Pool resting on both HDDs. The Shell says, i shall run an scurb, because it has lost the last 5 Secounds of data, but the data is till this point consistent.
But after the Pool was online again, the running reset is back again. I disconnected the pool drives harnesses, and the reset disappeard again.

Any ideas, how i can dig into the problem? I think, as soon the scrubbing is done, the running reset should also dissapear, but how can i finish the scrubbing, when the kernel boots again and again?

JohnDigital · Nov 15, 2022

Try to reseat your RAM first by physically removing and air dusting the modules and the slots. And firmly reseating the RAM. If the problem persists Id do a full day of memtest.

andreasg · Dec 4, 2022

Hi,
i ran over 24h a RAM memtest. Nothing found, all works fine. There are running 3 other vm's without any findings.

I tried today to reinstall TrueNAS again. I were able to import the pool and perform an resilvering. It mentioned, that there was one Minor error. it ran, till i started a scrub task. It crashes immediately. Since then, I'm in the continuous kernel panic reboot situation.

Any idea what i can do about that? Scrubbing is standard process within ZFS i assume.

JohnDigital · Dec 4, 2022

Does this happen at any other time ever?

Id recommend test all the disks in the host. Make sure those are absolutely good too. smartctl -t long /dev/daX

Your in a virtual environment there could be some other issue too, im no pro at it. But almost certainly a hardware issue somewhere. Test what you can.

andreasg · Dec 12, 2022

Hi,

i ran some tests of the vmdk - Files and the HDDs on which they are running. No observertion.

I imported the pool again, and suspended the scrub prozess right behind --> no problem. no kernel panic. After Start it again, kernel panic. But now i managed, to brows through the log files. The fault is triggered by an Inter divide in kernel mode, running scrub.

Dec 12 01:03:26 truenas 1 2022-12-12T01:03:26.936825-08:00 truenas.local smartd1134 - - Configuration file /usr/local/etc/smartd.conf parsed but has no entries
Dec 12 01:03:27 truenas 1 2022-12-12T01:03:27.128188-08:00 truenas.local daemon1118 - - 2022-12-12 01:03:27,121:wsdd WARNING(pid 1119): no interface given, using all interfaces
Dec 12 01:03:32 truenas 1 2022-12-12T01:03:32.323637-08:00 truenas.local daemon1281 - - 2022-12-12 01:03:32,322:wsdd WARNING(pid 1282): no interface given, using all interfaces
Dec 12 01:27:53 truenas syslog-ng[990]: syslog-ng starting up; version='3.35.1'
Dec 12 01:27:53 truenas Fatal trap 18: integer divide fault while in kernel mode
Dec 12 01:27:53 truenas cpuid = 0; apic id = 00
Dec 12 01:27:53 truenas instruction pointer     = 0x20:0xffffffff828f2795
Dec 12 01:27:53 truenas stack pointer           = 0x28:0xfffffe0111399050
Dec 12 01:27:53 truenas frame pointer           = 0x28:0xfffffe0111399060
Dec 12 01:27:53 truenas code segment            = base 0x0, limit 0xfffff, type0x1b
Dec 12 01:27:53 truenas                         = DPL 0, pres 1, long 1, def32 0, gran 1
Dec 12 01:27:53 truenas processor eflags        = interrupt enabled, resume, IOPL = 0
Dec 12 01:27:53 truenas current process         = 6 (txg_thread_enter)
Dec 12 01:27:53 truenas trap number             = 18
Dec 12 01:27:53 truenas panic: integer divide fault
Dec 12 01:27:53 truenas cpuid = 0
Dec 12 01:27:53 truenas time = 1670837208
Dec 12 01:27:53 truenas KDB: stack backtrace:
Dec 12 01:27:53 truenas db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0111398e70
Dec 12 01:27:53 truenas vpanic() at vpanic+0x17f/frame 0xfffffe0111398ec0
Dec 12 01:27:53 truenas panic() at panic+0x43/frame 0xfffffe0111398f20
Dec 12 01:27:53 truenas trap_fatal() at trap_fatal+0x385/frame 0xfffffe0111398f80
Dec 12 01:27:53 truenas calltrap() at calltrap+0x8/frame 0xfffffe0111398f80
Dec 12 01:27:53 truenas --- trap 0x12, rip = 0xffffffff828f2795, rsp = 0xfffffe0111399050, rbp = 0xfffffe0111399060 ---
Dec 12 01:27:53 truenas ext_size_add() at ext_size_add+0x35/frame 0xfffffe0111399060
Dec 12 01:27:53 truenas range_tree_add_impl() at range_tree_add_impl+0x12f5/frame 0xfffffe0111399150
Dec 12 01:27:53 truenas scan_io_queue_insert_impl() at scan_io_queue_insert_impl+0xa7/frame 0xfffffe0111399190
Dec 12 01:27:53 truenas dsl_scan_scrub_cb() at dsl_scan_scrub_cb+0x6c4/frame 0xfffffe0111399210
Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x4b0/frame 0xfffffe01113992c0
Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x663/frame 0xfffffe0111399370
Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe0111399420
Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe01113994d0
Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe0111399580
Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe0111399630
Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x414/frame 0xfffffe01113996e0
Dec 12 01:27:53 truenas dsl_scan_visitbp() at dsl_scan_visitbp+0x332/frame 0xfffffe0111399790
Dec 12 01:27:53 truenas dsl_scan_visit_rootbp() at dsl_scan_visit_rootbp+0x12f/frame 0xfffffe01113997e0
Dec 12 01:27:53 truenas dsl_scan_visitds() at dsl_scan_visitds+0xc0/frame 0xfffffe0111399990
Dec 12 01:27:53 truenas dsl_scan_visit() at dsl_scan_visit+0x1f6/frame 0xfffffe0111399b80
Dec 12 01:27:53 truenas dsl_scan_sync() at dsl_scan_sync+0xc08/frame 0xfffffe0111399bf0
Dec 12 01:27:53 truenas spa_sync() at spa_sync+0xaf9/frame 0xfffffe0111399e20
Dec 12 01:27:53 truenas txg_sync_thread() at txg_sync_thread+0x30e/frame 0xfffffe0111399ef0
Dec 12 01:27:53 truenas fork_exit() at fork_exit+0x7e/frame 0xfffffe0111399f30
Dec 12 01:27:53 truenas fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0111399f30
Dec 12 01:27:53 truenas --- trap 0, rip = 0xffffffff80aa32cf, rsp = 0, rbp = 0xffffffff832fffa0 ---
Dec 12 01:27:53 truenas mi_startup() at mi_startup+0xdf/frame 0xffffffff832fffa0
Dec 12 01:27:53 truenas swapper() at swapper+0x69/frame 0xffffffff832ffff0
Dec 12 01:27:53 truenas btext() at btext+0x22
Dec 12 01:27:53 truenas KDB: enter: panic
Dec 12 01:27:53 truenas ---<<BOOT>>---

jgreco · Dec 12, 2022

John Digital said:
But almost certainly a hardware issue somewhere.

I wouldn't bet on that. Well, actually I would, but probably not the way you mean it.

Corruptions in hypervisor environments are unfortunately somewhat common, because hypervisors have a chilling tendency to reorder data which causes ZFS to lose its marbles. This is why we have a very specific, very clear set of guidelines for virtualizing TrueNAS.

I'm going to make a guess that at least one of the following things is true:

1) These disks are not directly attached to the TrueNAS VM using PCIe passthru and an LSI HBA,

2) These are vmdk disks that are situated on a VMFS datastore that lacks controller level redundancy (i.e. JBOD or RAID0),

3) The controller that runs these disks is a RAID controller with write caching

The pool is probably toast. Your best bet is to mount it read-only, and copy all the data that you can off of it. There is a good chance that it may crash during this process. That is what backups are for.

Then you will want to review the guide I posted nearly ten years ago on how to properly virtualize TrueNAS.

"Absolutely must virtualize FreeNAS!" ... a guide to not completely losing your data.

[---- 2018/02/27: This is still as relevant as ever. As PCIe-Passthru has matured, fewer problems are reported. I've updated some specific things known to be problematic ----] [---- 2014/12/24: Note, there is another post discussing how to deploy a small FreeNAS VM instance for basic file...

www.truenas.com

jgreco · Dec 12, 2022

Also, thread moved to the virtualization forum, since this really has very little to do with TrueNAS itself, and looks to be just a victim of improper virtualization.

andreasg · Dec 13, 2022

jgreco said:
I wouldn't bet on that. Well, actually I would, but probably not the way you mean it.

Corruptions in hypervisor environments are unfortunately somewhat common, because hypervisors have a chilling tendency to reorder data which causes ZFS to lose its marbles. This is why we have a very specific, very clear set of guidelines for virtualizing TrueNAS.

I'm going to make a guess that at least one of the following things is true:

1) These disks are not directly attached to the TrueNAS VM using PCIe passthru and an LSI HBA,

2) These are vmdk disks that are situated on a VMFS datastore that lacks controller level redundancy (i.e. JBOD or RAID0),

3) The controller that runs these disks is a RAID controller with write caching

The pool is probably toast. Your best bet is to mount it read-only, and copy all the data that you can off of it. There is a good chance that it may crash during this process. That is what backups are for.

Then you will want to review the guide I posted nearly ten years ago on how to properly virtualize TrueNAS.

"Absolutely must virtualize FreeNAS!" ... a guide to not completely losing your data.

[---- 2018/02/27: This is still as relevant as ever. As PCIe-Passthru has matured, fewer problems are reported. I've updated some specific things known to be problematic ----] [---- 2014/12/24: Note, there is another post discussing how to deploy a small FreeNAS VM instance for basic file...

www.truenas.com

Hi Jgreco,

Your right, the disks are not directly attached, and they are running in mirrored mode within TrueNAS. No RAID at Hardware level. I knew the possibility of PCIe passthru, but i used this in a different content. To use it also for the storage simply doesn't came into my mind. I ran an HPE Mircoserver Gen10. There is also the possibility to pass marvels SATA - Controller through.

I managed to SFTP into the storage. The most important files were backup already elsewhere beforehand, but I have some carefull collected (and obsolete) SD-Card Images, to big to backup in a proper way.

I was a bit dissatisfied with my configuration anyway as I never saw the S.M.A.R.T status of the drives --> Failure was always an option. Because of that, I checked it elsewhere and replaced the drives it as soon as something were read out curios.

I assume, with your idea, i see also this status within TrueNAS.

P.S.: I might got the point with your USB - Stick Handling not right...
1. Install TrueNAS as OS onto the First USB - Stick running no ESXI, and configure it as intended,
2. Use secound USB, install ESXI on that, do the ESXI stuff.
3. Pass through the SATA - Controller
4. Configure TrueNAS VM, and import Configuration from the first USB - Stick.
--> Boot the VM from the USB - Stick, or just import the configuration?
--> When boot from USB the VM how's the performance? I Usally boot the VMs from an NMVe SSD...

Regards
Andreas

ChrisRJ · Dec 14, 2022

andreasg said:
carefull collected (and obsolete) SD-Card Images, to big to backup in a proper way.

What size are we talking about? Anything up to 20 TB is certainly not a huge challenge these days with a couple (for multiple generations and redundancy) of USB HDDs.

Important Announcement for the TrueNAS Community.

Kernel Panic during ZFS Pool Scrub

andreasg

Cadet

Attachments

JohnDigital

Guru

andreasg

Cadet

JohnDigital

Guru

andreasg

Cadet

jgreco

Resident Grinch

"Absolutely must virtualize FreeNAS!" ... a guide to not completely losing your data.

jgreco

Resident Grinch

andreasg

Cadet

"Absolutely must virtualize FreeNAS!" ... a guide to not completely losing your data.

ChrisRJ

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Kernel Panic during ZFS Pool Scrub

Cadet

Attachments

Guru

Cadet

Guru

Cadet

Resident Grinch

Resident Grinch

Cadet

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Kernel Panic during ZFS Pool Scrub"

Similar threads