Weird reboot followed by ix-zfs.service job taking forever

CheeryFlame · Jul 31, 2023

mrfr34ky said:
When I updated to the latest patch the server came back rather quickly for me. Did you already update to 22.12.3.3?

Nope just noticed the update. I'm not much confident in upgrading anything before I repair my app pool which got a checksum error. I'm in the process of doing this with heavyscript but the cnpg databases are making me waste my time backing it up. Anyway I'll post more info once my system is more stable.

CheeryFlame · Jul 31, 2023

mrfr34ky said:
When I updated to the latest patch the server came back rather quickly for me. Did you already update to 22.12.3.3?

Just updated and the first reboot in fact got rid of this annoyance. I tried rebooting again and the issue came back immediately.

mrfr34ky · Jul 31, 2023

gravelfreeman said:
Just updated and the first reboot in fact got rid of this annoyance. I tried rebooting again and the issue came back immediately.

View attachment 68868

Damn

GyulaMasa · Aug 6, 2023

Something similar heppened to me too.
I use TrueNAS under Proxmox, in a VM.
I have 3 HDD passed through, 4 CPU cores of a E5-2630L v3 CPU, 42GB of ECC DDR3 RAM.
I installed the system like 2 weeks ago, after extensive testing, I started two replication jobs, both were pull operation from another TrueNAS backup server. First job was 4,5TB from the backup pool, then the 13TB from the media backup.
Last time, I checked webUI, it was almost finished with the second backup.
Today, I tried to log in, but the WebUI shows: "Waiting for Active TrueNAS controller to come up..."
(this makes me suspect, that the VM actually has restarted for esome reason, since the PC itself was not switched off for at least a week.)
If I go to Proxmox console, it shows that the same process is just waiting.
I already restarted the VM 2 times.
Finally it booted up, so lets hope, it was only a one time hickup...

grigory · Aug 9, 2023

I also have this issue, even on a fresh install after loading my config backup. This happened before the reinstall as well. Always seems to end after 15m and the system boots correctly.
I am not using a mirrored boot device.

Edit: I was running TrueNAS-SCALE-22.12.3.1 and used the same version for my fresh install.
Edit: I just remembered something, a while ago I had a kernel panic when an encrypted dataset would not delete. My pool would no longer import and so I added some tunables after getting some advice on discord. It looked a lot like this issue https://github.com/openzfs/zfs/issues/13483
Anyway, when I had those enabled I don't recall having this issue where the ix-zfs.service would timeout, or take 15m~ to load.

Code:

echo 1 > /sys/module/zfs/parameters/zil_replay_disable
echo 1 > /sys/module/zfs/parameters/zfs_recover
echo 0 > /sys/module/zfs/parameters/spa_load_verify_data
echo 0 > /sys/module/zfs/parameters/spa_load_verify_metadata

I am not suggesting that anyone enable those. However, I noticed this other issue https://github.com/openzfs/zfs/issues/10411
and there are some suggestions there that similar tunables resolved a similar issue:

echo "0" > /sys/module/zfs/parameters/spa_load_verify_metadata
echo "0" > /sys/module/zfs/parameters/spa_load_verify_data

Another issue here https://github.com/openzfs/zfs/discussions/8847#discussioncomment-231459
has the linked comment that suggests that this might have something to do with loading meta data?
They set the same two tunables to false
echo "0" > /sys/module/zfs/parameters/spa_load_verify_metadata
echo "0" > /sys/module/zfs/parameters/spa_load_verify_data
but then they say the issue comes back,

Then I enable zfs-import-cache service, then restart all zfs services and the issue is resolved until next reboot, which the problem comes back.

similar to what has been reported in this thread.
I cannot say that this is at all related but looks very similar. I wonder if this might be a zfs bug rather than an ixsystems bug?

CheeryFlame · Aug 10, 2023

This is now happening on my new server made from consumer PC parts.

This system used to boot in 1-2 minute, what a shame.

It happened after I tried enabling my GPU with this command:

Code:

# DON'T EVER USE THIS COMMAND
midclt call system.advanced.update '{"kernel_extra_options": "i915.force_probe=4692"}'

System wouldn't boot, I had to try multiple things but what worked for me is

In grub boot on Initial Install

Enter this command (thanks to shadowfall)

Code:

midclt call system.advanced.update '{"kernel_extra_options": ""}'

When I rebooted I got this nightmare on my new server.

I'm so tired.

mrfr34ky · Aug 10, 2023

gravelfreeman said:
This is now happening on my new server made from consumer PC parts.

This system used to boot in 1-2 minute, what a shame.

It happened after I tried enabling my GPU with this command:

Code:
# DON'T EVER USE THIS COMMAND midclt call system.advanced.update '{"kernel_extra_options": "i915.force_probe=4692"}'

System wouldn't boot, I had to try multiple things but what worked for me is

In grub boot on Initial Install

Enter this command (thanks to shadowfall)

Code:
midclt call system.advanced.update '{"kernel_extra_options": ""}'

When I rebooted I got this nightmare on my new server.

View attachment 69180

I'm so tired.

Weird. I never did anything like that yet have the issue :/ did anyone create a ticket to ix-systems yet?

CheeryFlame · Aug 10, 2023

Yes I just created the ticket, if you guys can chime in and say that I'm not making this out.

Issue navigator - iXsystems TrueNAS Jira

ixsystems.atlassian.net

HoneyBadger · Aug 14, 2023

Can anyone drop a debug file (System Settings -> Advanced -> Save Debug) in the ticket? Preferably from immediately following a "stalled" ix-zfs.service boot.

gravelfreeman said:
It happened after I tried enabling my GPU with this command:

Provided that you have the matching iGPU (verify with sudo dmesg | grep i915) this command shouldn't be harmful and in fact should enable the iGPU for transcode in Bluefin.

I'd be very cautious about turning on sysctls that will disable data/metadata verify on import unless your plan is to immediately scrub your pool after every import/reboot. Looking at some of the linked issues on Github where those tunables originated seems to indicate that a user had a slow/failing device (possibly remotely presented from a DAS with RAID logic) that was tripping the "deadman" slow vdev timer:

Code:

1559575072   spa_misc.c:408:spa_load_note(): spa_load(datapool, config untrusted): using uberblock with txg=31064811
1559575156   spa_misc.c:408:spa_load_note(): spa_load(datapool, config trusted): spa_load_verify found 0 metadata errors and 7 data errors
1559575156   mmp.c:249:mmp_thread_start(): MMP thread started pool 'datapool' gethrtime 514041347691
1559575156   spa.c:7556:spa_async_request(): spa=datapool async request task=1
1559575156   spa_misc.c:408:spa_load_note(): spa_load(datapool, config trusted): LOADED
1559575161   spa_history.c:319:spa_history_log_sync(): txg 31064813 open pool version 5000; software version unknown; uts ---=hostname_removed=-- 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019 x86_64
1559575166   spa.c:7556:spa_async_request(): spa=datapool async request task=32
1559575167   spa_history.c:319:spa_history_log_sync(): txg 31064817 import pool version 5000; software version unknown; uts ---=hostname_removed=-- 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019 x86_64
1559575167   dsl_scan.c:3537:dsl_scan_sync(): doing scan sync txg 31064817; ddt bm=0/0/0/0
1559575768   spa_misc.c:607:spa_deadman(): slow spa_sync: started 600 seconds ago, calls 1
1559575768   vdev.c:4668:vdev_deadman(): slow vdev: /dev/disk/by-id/dm-uuid-mpath-3600c0ff0002885a3fc7d605a01000000 has 1 active IOs
1559575828   spa_misc.c:607:spa_deadman(): slow spa_sync: started 660 seconds ago, calls 2
1559575828   vdev.c:4668:vdev_deadman(): slow vdev: /dev/disk/by-id/dm-uuid-mpath-3600c0ff00029196e227e605a01000000 has 1 active IOs
1559575888   spa_misc.c:607:spa_deadman(): slow spa_sync: started 720 seconds ago, calls 3
1559575888   vdev.c:4668:vdev_deadman(): slow vdev: /dev/disk/by-id/dm-uuid-mpath-3600c0ff0002885a3fc7d605a01000000 has 1 active IOs
1559575948   spa_misc.c:607:spa_deadman(): slow spa_sync: started 781 seconds ago, calls 4
1559575948   vdev.c:4668:vdev_deadman(): slow vdev: /dev/disk/by-id/dm-uuid-mpath-3600c0ff0002885a32e7e605a01000000 has 1 active IOs

CheeryFlame · Aug 21, 2023

HoneyBadger said:
Provided that you have the matching iGPU (verify with sudo dmesg | grep i915) this command shouldn't be harmful and in fact should enable the iGPU for transcode in Bluefin.

I have a matching GPU and it bricked my system. Was lucky enough to get back in the system 1 time after trying different configurations in the bios. Saved the config from there and reinstalled.

I sent a debug file although I read AFTER that it needs to be right after a reboot, I'll send another one next time I reboot but unfortunately I can't reboot this server as I want!

Thank you for helping us.

Important Announcement for the TrueNAS Community.

Weird reboot followed by ix-zfs.service job taking forever

CheeryFlame

Contributor

CheeryFlame

Contributor

mrfr34ky

Dabbler

GyulaMasa

Dabbler

grigory

Dabbler

CheeryFlame

Contributor

mrfr34ky

Dabbler

CheeryFlame

Contributor

Issue navigator - iXsystems TrueNAS Jira

HoneyBadger

actually does care

CheeryFlame

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Weird reboot followed by ix-zfs.service job taking forever

Contributor

Contributor

Dabbler

Dabbler

Dabbler

Contributor

Dabbler

Contributor

actually does care

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Weird reboot followed by ix-zfs.service job taking forever"

Similar threads