Booting stuck at "Import ZFS pools" on Scale 21.06

izomiac · Jul 14, 2021

I've currently got an issue where TrueNAS Scale 21.06 is failing to boot, and appears to be stuck importing my pools. It's currently at over 19 hours, so I'm wondering if I should just give it more time or what.

Before rebooting, I had four pools. "Pool", which is 8 x 6 TB HDDs in RAID-Z2, "SSD" which is a single SSD of a few hundred GB and where the system dataset is stored, boot, which is a pair of mirrored USB drives, and "ExternalHome" which is a single 6 TB drive where I made an external backup of my most essential files for transport to an offsite server that I recently got working again after a hardware failure. I was a little optimistic about replication speeds to ExternalHome, so I tried to disconnect the pool prior to replication finishing on some non-essential datasets, but that hung in the GUI and failed to complete after several minutes so I eventually had to disconnect the drive physically so I could throw it in my car and leave for the offsite location. When I returned a couple days later I noticed the ExternalHome pool still listed as unavailable, so I attempted to disconnect the pool again, which still hung at 80%. Later that night I noticed every CPU running at 100%, so I attempted a reboot. When it hadn't come up after four hours I noticed the error and tried reconnecting the drive for ExternalHome, but it made no difference. Last night I just let it run and it's currently at 19 hours.

My theories are that a) there's some ZFS equivalent of chkdisk running and it's going to take a while, b) it's trying to remount ExternalHome and having difficulties, or c) something terrible has happened to one of my pools. Before I try a fresh TrueNAS install I figured I'd touch base with some more knowledgeable people to see how long I should wait, and perhaps get some advice for attempting a recovery if something truly bad happened.

Kris Moore · Jul 15, 2021

It doesn't need to do that kind of check for that long.. You may want to hit ctrl-C and let it finish booting so you can do more investigation.

izomiac · Jul 18, 2021

So, Ctrl+C didn't let me cancel that part of the boot process, so I tried a fresh install. My "SSD" pool imported just fine, but my main pool, "Pool", gives an error on import.

Code:

Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 94, in main_worker
res = MIDDLEWARE._run(*call_args)
File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 45, in _run
return self._call(name, serviceobj, methodobj, args, job=job)
File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 39, in _call
return methodobj(*params)
File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 39, in _call
return methodobj(*params)
File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1008, in nf
return f(*args, **kwargs)
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 388, in import_pool
self.logger.error(
File "libzfs.pyx", line 391, in libzfs.ZFS.__exit__
File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 379, in import_pool
raise CallError(f'Pool {name_or_guid} not found.', errno.ENOENT)
middlewared.service_exception.CallError: [ENOENT] Pool 687291309676488922 not found.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/middlewared/job.py", line 382, in run
await self.future
File "/usr/lib/python3/dist-packages/middlewared/job.py", line 418, in __run_body
rv = await self.method(*([self] + args))
File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1004, in nf
return await f(*args, **kwargs)
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool.py", line 1420, in import_pool
await self.middleware.call('zfs.pool.import_pool', pool['guid'], {
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1250, in call
return await self._call(
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1215, in _call
return await self._call_worker(name, *prepared_call.args)
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1221, in _call_worker
return await self.run_in_proc(main_worker, name, args, job)
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1148, in run_in_proc
return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1122, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [ENOENT] Pool 687291309676488922 not found.

The funny thing is, it still gave some familiar errors about my security camera datasets being at their quota (they stay there, the camera overwrites the oldest files if there's no freespace on the NFS share), so I checked the shell and, sure enough, Pool was mounted and the unencrypted datasets were fully accessible. No sign of it in the GUI though...

Code:

truenas# zpool status Pool
  pool: Pool
 state: ONLINE
  scan: resilvered 4.65T in 23:06:00 with 0 errors on Thu Jun 24 06:11:55 2021
config:

        NAME                                      STATE     READ WRITE CKSUM
        Pool                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            f7579d02-16e0-11ea-946b-0cc47acdefe7  ONLINE       0     0     0
            3b93472d-b529-11eb-ad8d-0cc47acdefe7  ONLINE       0     0     0
            da37f0fd-83c0-488b-8b51-17479554aa70  ONLINE       0     0     0
            18b69500-175e-11ea-946b-0cc47acdefe7  ONLINE       0     0     0
            0d551cf7-17c1-11ea-946b-0cc47acdefe7  ONLINE       0     0     0
            aaaf34c6-1d6b-11ea-bfd3-0cc47acdefe7  ONLINE       0     0     0
            4057ca04-5aef-4f76-bf05-7e71ce8f01dd  ONLINE       0     0     0
            be13c362-1814-11ea-946b-0cc47acdefe7  ONLINE       0     0     0

errors: No known data errors

izomiac · Jul 20, 2021

Ok, so I figured out the issue. From the fresh install I determine that the pool "Pool" was the problem so I disconnected all of the drives in Pool and was able to boot. From there I tried reconnecting the drives and importing, with identical results to the fresh install (GUI gives an error, but it's fully mounted in the shell). I then tried exporting from the shell, which worked fine, and importing in the shell succeeded, but it gave some warnings about being unable to create mount points for various child datasets under Pool/SSD. For context, since the pool "SSD" is hosted on a single drive, I have it replicate to the main pool with each hourly snapshot, so I can restore it whenever the SSD eventually dies. Being unable to create mount points wasn't a concern for me because I never plan to access the data from that location, it's simply a backup. Plus, I kind of assumed they'd be replicated along with everything else, and I had no idea about the warnings since I must have tweaked the replication task some weeks ago and hadn't rebooted since. Anyway, I destroyed that dataset and that fixed everything. The GUI no longer had any issues with importing it and I was able to boot just fine.

zetabax · Aug 26, 2021

After upgrading Scale beta with the nightly released on 8/26 I've ran into a similar issue.

Being a bit of a noob and was hoping someone could please point me to instructions on how to export and import them in shell please.

izomiac · Aug 26, 2021

The easiest way to get TrueNAS Scale to boot with this error is to physically disconnect your pools one-by-one while the server is off until it successfully boots. (I.e. for me my main storage pool was the problem so I removed all eight drives in that pool and tried booting without them. Remove every drive in a pool all at once while the system is off or you might trigger a resilver.) If that isn't practical then try booting from a fresh install USB drive / ISO image that doesn't have any pools configured.
Next, once TrueNAS is booted, reconnect the drives (if not using a fresh install).
Then open the shell ("System Settings" then "Shell" in the GUI).
Run "zpool list", verify that your pool is there, then run "zpool import YourPoolNameHere".
In my case I got a bunch of errors with import regarding some replicated datasets. I used "zfs destroy -r YourPoolNameHere/ProblematicDataset" to delete the dataset responsible for the errors since it was just a backup. You may want to use a different approach if the data is important.
Run "zpool export YourPoolNameHere" to unmount the pool, and repeat steps 4 - 6 until zpool has no complaints.
Reboot and the problem should be resolved. Or at least it was for me. Alternatively you can import your pool from the GUI under "Storage" if you want to start with a fresh install, just be sure you exported it in the shell before using the GUI to import it.
If you get your system working again with these steps then you may want to comment on the issue in the bugtracker since I haven't figured out how to reproduce the issue.

Disclaimer: This is from memory without testing. Don't try it on data you care about until you read the man page to understand what each command and switch does.

zetabax · Aug 26, 2021

Awesome, thank you so much @izomiac for putting this together for me. I'll give it a try!

shuxiao9058 · Aug 29, 2021

the same issue, stuff on import zfs pool.

flashdrive · Sep 16, 2021

Hello,

I reported this:

SCALE import pool fail during startup of system

I am not sure if this is Debian or SCALE related: I have added another drive in my system after I setup SCALE with a pool. The drive must have used the place of an already assigned HDD. Original setup on mainboard: SATA0 - HDD1 SATA1 - blank, nothing attached SATA2 - HDD2 SATA3 - HDD3 Now...

www.truenas.com

flashdrive · Sep 16, 2021

I can report this issue as well just today by upgrading from one nightly build of September to another nightly.

How does one file a bug ticket?

flashdrive · Sep 16, 2021

I have added a comment here:

https://jira.ixsystems.com/browse/NAS-111471

Why is JIRA so slow while loading?

c77dk · Sep 17, 2021

the one from this night 0917 fails the same way - happy that there's a way to boot back into 21.08

freqlabs · Sep 17, 2021

izomiac said:
In my case I got a bunch of errors with import regarding some replicated datasets.

Do you recall what kind of errors?

flashdrive · Sep 17, 2021

Hello iXsystems @freqlabs ,

I am happy to help out if you tell me what kind of data you need to have a look into it more deeply.

I will not alter my systems boot drive for that matter.

However this error I am seeing is super easy to replicate, even in Virtual Box.

Just do the nightly update from any given point in September to replicate.

freqlabs · Sep 17, 2021

flashdrive said:
However this error I am seeing is super easy to replicate, even in Virtual Box.

Just do the nightly update from any given point in September to replicate.

Unfortunately it's not something I've seen with my system running SCALE nightlies. I think we have some debug information attached to the ticket already, but it's tricky because we can't get debug from a system that isn't booting.

c77dk · Sep 17, 2021

freqlabs said:
Unfortunately it's not something I've seen with my system running SCALE nightlies. I think we have some debug information attached to the ticket already, but it's tricky because we can't get debug from a system that isn't booting.

is that a "fresh" install? Not something sidegraded from CORE? I'm wondering if it might be something with featureflags? Mine is an old CORE system, with the original features

ravingamm · Sep 26, 2021

izomiac said:
The easiest way to get TrueNAS Scale to boot with this error is to physically disconnect your pools one-by-one while the server is off until it successfully boots. (I.e. for me my main storage pool was the problem so I removed all eight drives in that pool and tried booting without them. Remove every drive in a pool all at once while the system is off or you might trigger a resilver.) If that isn't practical then try booting from a fresh install USB drive / ISO image that doesn't have any pools configured.

Next, once TrueNAS is booted, reconnect the drives (if not using a fresh install).

Then open the shell ("System Settings" then "Shell" in the GUI).

Run "zpool list", verify that your pool is there, then run "zpool import YourPoolNameHere".

In my case I got a bunch of errors with import regarding some replicated datasets. I used "zfs destroy -r YourPoolNameHere/ProblematicDataset" to delete the dataset responsible for the errors since it was just a backup. You may want to use a different approach if the data is important.

Run "zpool export YourPoolNameHere" to unmount the pool, and repeat steps 4 - 6 until zpool has no complaints.

Reboot and the problem should be resolved. Or at least it was for me. Alternatively you can import your pool from the GUI under "Storage" if you want to start with a fresh install, just be sure you exported it in the shell before using the GUI to import it.

If you get your system working again with these steps then you may want to comment on the issue in the bugtracker since I haven't figured out how to reproduce the issue.

Disclaimer: This is from memory without testing. Don't try it on data you care about until you read the man page to understand what each command and switch does.

I'm running TrueNAS Scale 21.08 BETA 1 and ran into this problem. I had setup replication tasks between two pools for my ix-application data and another persistent datastore I setup.

Upon a reboot I notice my NAS didn't come back up, on closer look it was stuck on "importing ZFS" I followed the above to fix mine..

-I removed all my disks which were used for backup (i had a pool of SSD just for VMs which I left online)
-rebooted the server and successfully got into the UI
-put the pool0 disks back in and ran the zpool import pool0 which showed the errors of read only on the pool0/backups datastore
-I put in the disks for pool1 and ran zpool import pool1 this also showed errors for readonly on pool1/backups
-I ran zfs destroy -r pool0/backups and pool1/backups which took awhile
-Once complete and rebooted the server and it could boot correctly.

thanks izomiac

Keyakinan · Oct 20, 2021

ravingamm said:
I'm running TrueNAS Scale 21.08 BETA 1 and ran into this problem. I had setup replication tasks between two pools for my ix-application data and another persistent datastore I setup.

Upon a reboot I notice my NAS didn't come back up, on closer look it was stuck on "importing ZFS" I followed the above to fix mine..

-I removed all my disks which were used for backup (i had a pool of SSD just for VMs which I left online)
-rebooted the server and successfully got into the UI
-put the pool0 disks back in and ran the zpool import pool0 which showed the errors of read only on the pool0/backups datastore
-I put in the disks for pool1 and ran zpool import pool1 this also showed errors for readonly on pool1/backups
-I ran zfs destroy -r pool0/backups and pool1/backups which took awhile
-Once complete and rebooted the server and it could boot correctly.

thanks izomiac

So your solution is completely destroying a pool.. what if one can't do that..?

flashdrive · Oct 20, 2021

Wait for a fix in the beta TN Scale...

Keyakinan · Oct 20, 2021

flashdrive said:
Wait for a fix in the beta TN Scale...

My whole server is done right now so not really an option? I don't think this is something that needs to be fixed but more a server side error. All 3 drives can be found and it boots find without a config file uploaded, but when I upload the config file only then it crashes..

Important Announcement for the TrueNAS Community.

Booting stuck at "Import ZFS pools" on Scale 21.06

Dabbler

SVP of Engineering

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Cadet

Patron

Patron

Patron

Patron

iXsystems

Patron

iXsystems

Patron

Cadet

Dabbler

Patron

Dabbler

Similar threads