SOLVED TrueNAS Scale boot sequence stalling after loading ZFS module

Xeonian · May 9, 2023

Hi all, I'm experiencing some issues booting my TrueNAS instance.

The problems first manifested yesterday evening, when I noticed that I was unable to access the TrueNAS dashboard via browser, or connect to the server via SSH. This was a little odd, as the containerised applications on the instance were still operational, and I had only just uploaded a file via SFTP to Plex's media dataset. On the assumption that something had gone wrong with the native applications, I went to reboot the instance. Rebooting through the server shell reported an error (which sadly I did not make note of), at which point the shell would continue to take typed input, but was no longer responsive to any commands. Perhaps foolishly, I restarted the instance at a hardware level.

Since this reboot, I have been unable to get the OS fully operational. When the system boots, it progresses through the boot sequence for ~16 seconds, culminating with "ZFS: Loaded module...", at which point no further progress is made. After a few attempts to resolve the issue, I left the machine running overnight, but it had made no further progress by the morning.

I'll recount some of the attempts I've made to further diagnose the problem, but I'll admit I'm something of a novice when it comes to debugging issues this early in the boot sequence, so forgive me if some of these attempts come across as naive.

I have two mirrored USB boot drives (yes, I am aware this is no longer a recommended configuration, but I have yet to have made the time to migrate away), and between them I have access to OS versions 22.12.2, 22.12.0, and 22.04.0, all of which exhibit the same behaviour. I have also removed each boot drive in turn, and tried them in alternate USB ports, with no observable change in behaviour.

My NAS current consists of a single data pool, spread across two virtual devices, each consisting of three physical drives in a RAIDZ1 configuration. I attempted to boot with a single physical drive disconnected, each in turn, to try to rule out some catastrophic drive failure having caused a problem, but again, this appeared to have no effect.

My current (hopeful) working theory is that I just so happened to attempt to access the dashboard while some maintenance activity - presumably a scrub - happened, which deprived resources from other services provided by the host, which I interrupted by my gung-ho attempt to restart the machine. The boot sequence is stalling as said maintenance task is attempting to complete while mounting the pool, and this will take more than the eight hours or so that the machine was left overnight. While there's no logs to indicate that this is the case, there does appear to be drive activity - unfortunately my case does not have an LED indicator for this, but there is certainly the sound of activity.

Barring that being the case, I'm at a lost for what to do next. I could attempt to reinstall the OS, but, given that I have tried to revert to earlier versions of the OS, and no user changes were made prior to the initial loss of access to the dashboard, I am pessimistic that it would make a difference.

Below, I've attached a screenshot of an example of the shell logs at the point where the boot sequence stalls. Naturally, there is more logged before this, but as this is occurring so early in the boot sequence I'm unclear how I would share these short of just recording a video.

The device specs are as follows:

CPU: Intel Pentium G4560
Motherboard: Supermicro X11SSL-F
Memory: 2x 8GB Crucial DDR4 Server Memory, PC4-17000 (2133); 2x Kingston Server Premier 8GB (1x 8GB) 2400MHz DDR4
Boot drives: 2x 16GB SanDisk Ultra Fit USB Flash Drive
Data drives: 3x WD Red 4TB in RAIDZ1 configuration; 3x WD Red 6TB in RAIDZ1 configuration

Any advice would be appreciated, thanks.

Ericloewe · May 9, 2023

Xeonian said:
I have two mirrored USB boot drives (yes, I am aware this is no longer a recommended configuration, but I have yet to have made the time to migrate away),

Before wasting time with more troubleshooting, I think it's time to bite the bullet and reinstall to an SSD or two. This could very well be a corrupted install.

Xeonian said:
The boot sequence is stalling as said maintenance task is attempting to complete while mounting the pool, and this will take more than the eight hours or so that the machine was left overnight.

Conceptually, ZFS defers block freeing operations to run in the background, since they're slow and POSIX doesn't care if the data has really been deleted by ZFS' lower layers. However, if a pool is imported with pending frees in the queue, it has to be emptied at import time.
That said, I can't imagine a realistic scenario in your case where this would take 8+ hours on your data pool. The boot pool, however, is subject to the whims of the USB flash drives, and who knows what insanity goes on there.

Xeonian said:
Motherboard: Supermicro X11SSL-F

Xeonian said:
Below, I've attached a screenshot of an example of the shell logs at the point where the boot sequence stalls. Naturally, there is more logged before this, but as this is occurring so early in the boot sequence I'm unclear how I would share these short of just recording a video.

You should use IPMI, saves you the physical monitor and you get the option to use a serial console instead of just VGA+keyboard and mouse. The serial console is very handy in these cases, if properly configured, because you can capture the whole boot output without needing a video. Also good to turn multi-100 kB images into <10 kB text.

Xeonian · May 9, 2023

Thanks for the quick response, and good call on using IPMI, somehow in five years of operating this system I hadn't thought to utilise it before now.

Sadly, it appears you are correct, there seems to be corruption in the USB keys. I managed to get the drive to report read errors from the boot drive in various attempts, and salvaged an SSD from my personal PC that had been supplanted by an M.2 drive for a fresh install, which booted up painlessly.

Assuming I can work around the hardware fault to read data from the old boot drives, is there any way to extract my prior system configuration from an offline operating system, or should I presume it to be lost and recover my apps etc from the data pool manually?

Xeonian · May 9, 2023

Scratch that last part. I managed to mount the old boot partition to my laptop and pull out a configuration DB from the mirror disk, albeit a little outdated. Serves me right for not doing regular backups, I suppose.

The server is up and running and I have my applications alive again, albeit with some shortcomings (shortage of SATA connectors causing me to temporarily offline a disk, some oddities around migrated settings from the configuration import), but they're beyond the scope of the original problem and are ones I know how to resolve. I'm not too familiar with the TrueNAS forum software, but if I'll try to label this thread as resolved if that's within my capability. Thanks again for the input.

Important Announcement for the TrueNAS Community.

SOLVED TrueNAS Scale boot sequence stalling after loading ZFS module

Xeonian

Dabbler

Attachments

Ericloewe

Server Wrangler

Xeonian

Dabbler

Xeonian

Dabbler

Similar threads