Truenas suddenly goes into reboot loop

digity

Contributor
Joined
Apr 24, 2016
Messages
156
I upgraded from TN Core 12.0-U5.1 to 12.0-U6 a couple of weeks ago and suddenly 2 days ago it has become unstable. First I was getting multiple email alerts with errors saying "* comet.lan had an unscheduled system reboot. The operating system successfully came back online at...", but TN seems to not have actually rebooted because my ESXi VMs stored on it never actually stopped, crashed or had issues. But, after 36 hours of this, my VMs have actually stopped running and going to the console I now see TN is stuck in a reboot loop. The console says "Beginning pools import" then stuff about "zfs_panic_recover" then the last line is "cpu_reset_proxy: Stopped CPU 9" before auto rebooting.

Any ideas on how to resolve this? Would going back to U5.1 fix this? If so, how do I do this?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
How did you pass through disks to your VM? Did you pass through a PCI HBA, or just individual disks? The latter is known to not be stable.
 

digity

Contributor
Joined
Apr 24, 2016
Messages
156
How did you pass through disks to your VM? Did you pass through a PCI HBA, or just individual disks? The latter is known to not be stable.
The ESXi datastore is stored on this TrueNAS server and shared over NFS over 40 Gbe NICs.
 

digity

Contributor
Joined
Apr 24, 2016
Messages
156
So far while troubleshooting this I...:

  1. Physically removed the pool's drives in question (3 x 1.6TB NVMe U.2 SSDs) from the drive bays and TrueNAS booted successfully. The pool obviously shows up as unavailable in the web UI. I plugged the SSDs back in at this point, but TrueNAS does not see them (going to Storage --> Disks does not show the SSDs and --> Import does not populate with the SSDs). Rebooting TrueNAS with the SSDs in just starts the reboot loop again.
  2. Changing to the boot environment back to 12.0-u4.1 does not resolve the boot loop issue. Reboots at the same exact spot.
  3. I remembered I enabled auto trim on this pool within the last week or so and figured that may be the issue. To try to disable auto trim I booted Ubuntu 20.04.3 and tried to import the pool (sudo zpool import -f encke), but got the error "cannot import 'encke': unsupported version or feature" ("This pool uses the following feature(s) not supported by this system: com.delphix:log_spacemap (Log metaslab changes on a single spacemap and flush them periodically.)"). Running "sudo zpool import encke -f -o readonly=on" imports the pool, but in read only.
Based on the discoveries above, any suggestions on getting this installation running again with this pool without having to go the data recovery, re-format/re-create pool and data restore route?
 
Top