Truenas suddenly goes into reboot loop

digity · Oct 30, 2021

I upgraded from TN Core 12.0-U5.1 to 12.0-U6 a couple of weeks ago and suddenly 2 days ago it has become unstable. First I was getting multiple email alerts with errors saying "* comet.lan had an unscheduled system reboot. The operating system successfully came back online at...", but TN seems to not have actually rebooted because my ESXi VMs stored on it never actually stopped, crashed or had issues. But, after 36 hours of this, my VMs have actually stopped running and going to the console I now see TN is stuck in a reboot loop. The console says "Beginning pools import" then stuff about "zfs_panic_recover" then the last line is "cpu_reset_proxy: Stopped CPU 9" before auto rebooting.

Any ideas on how to resolve this? Would going back to U5.1 fix this? If so, how do I do this?

Samuel Tai · Oct 30, 2021

How did you pass through disks to your VM? Did you pass through a PCI HBA, or just individual disks? The latter is known to not be stable.

digity · Oct 30, 2021

Samuel Tai said:
How did you pass through disks to your VM? Did you pass through a PCI HBA, or just individual disks? The latter is known to not be stable.

The ESXi datastore is stored on this TrueNAS server and shared over NFS over 40 Gbe NICs.

digity · Oct 31, 2021

So far while troubleshooting this I...:

Physically removed the pool's drives in question (3 x 1.6TB NVMe U.2 SSDs) from the drive bays and TrueNAS booted successfully. The pool obviously shows up as unavailable in the web UI. I plugged the SSDs back in at this point, but TrueNAS does not see them (going to Storage --> Disks does not show the SSDs and --> Import does not populate with the SSDs). Rebooting TrueNAS with the SSDs in just starts the reboot loop again.
Changing to the boot environment back to 12.0-u4.1 does not resolve the boot loop issue. Reboots at the same exact spot.
I remembered I enabled auto trim on this pool within the last week or so and figured that may be the issue. To try to disable auto trim I booted Ubuntu 20.04.3 and tried to import the pool (sudo zpool import -f encke), but got the error "cannot import 'encke': unsupported version or feature" ("This pool uses the following feature(s) not supported by this system: com.delphix:log_spacemap (Log metaslab changes on a single spacemap and flush them periodically.)"). Running "sudo zpool import encke -f -o readonly=on" imports the pool, but in read only.

Based on the discoveries above, any suggestions on getting this installation running again with this pool without having to go the data recovery, re-format/re-create pool and data restore route?

Important Announcement for the TrueNAS Community.

Truenas suddenly goes into reboot loop

digity

Contributor

Samuel Tai

Never underestimate your own stupidity

digity

Contributor

digity

Contributor

Similar threads