Extremely slow boot seemingly hangs on Started Journal Service using ssd (no usb; no hdd) fresh install of 22.12.0 scale

Stolen_Walnut

Dabbler
Joined
Jan 1, 2023
Messages
10
I appear to be tripping over a boot issue in Truenas scale 22.12.0 that apparently wasn't present in 22.02.4.

High level issue description:
  1. 10-20 seconds into the boot process (after Linux kernel launch), boot becomes sluggish:
    1. Boot process being output to screen stops taking milliseconds and starts taking tens of seconds for each line
  2. System does not complete boot process. (waited 8+ hours)
    1. Last output line of boot process "[ 1335.985838] systemd[1]: Started Journal Service."
    2. System still logs USB plug/unplug events to screen
(PS - Seems to be "console=" order-related in grub. I figured that out at the very end of the post. Some of the debug / steps were order dependent, so I don't want to mess that up in case that needs to be referred back to... leaving the post as is. Feel free to skip to the very last section)

HW:
  • Motherboard make and model
    • MSI MEG Z690I UNIFY
  • CPU make and model
    • Intel Core i5-12500
  • RAM quantity
    • G.Skill Ripjaws S5 64GB (2 x 32GB) DDR5-5200
  • Hard drives, quantity, model numbers, and RAID configuration, including boot drives
    • 3x:
      • boot: SATA SSD - Samsung 870 EVO Series 2.5" 250GB
      • no-pool-assignment NVMe: Inland Premium 2TB SSD M.2
      • no-pool-assignment NVMe: Crucial - P3 2TB Internal SSD
  • Hard disk controllers
    • 3x integrated/on-board SSD/NVMe controllers:
      • lspci -nnn | grep -E "0106|0108"
        • 00:17.0 SATA controller [0106]: Intel Corporation Device [8086:7ae2] (rev 11)
        • 09:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology Device [c0a9:540a] (rev 01)
        • 13:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation PS5013 E13 NVMe Controller [1987:5013] (rev 01)
  • Network cards
    • 3x:
      • lspci -nnn | grep 0200
        • 08:00.0 Ethernet controller [0200]: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] [1d6a:07b1] (rev 02)
        • 0b:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 03)
      • USB based:
        • Belkin USB 2.0 Ethernet Adapter F4U047
          • from dmesg on plugin:
            • [ 7071.443790] usb 1-2: new high-speed USB device number 11 using xhci_hcd
            • [ 7071.613305] usb 1-2: New USB device found, idVendor=0b95, idProduct=7720, bcdDevice= 0.01
            • [ 7071.623043] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
            • [ 7071.631132] usb 1-2: Product: AX88x72A
            • [ 7071.635473] usb 1-2: Manufacturer: ASIX Elec. Corp.
            • [ 7071.641140] usb 1-2: SerialNumber: 0DFF1F
            • [ 7072.067353] asix 1-2:1.0 eth0: register 'asix' at usb-0000:00:14.0-2, ASIX AX88772 USB 2.0 Ethernet, 00:50:b6:0d:ff:1f
            • [ 7072.082502] asix 1-2:1.0 enx0050b60dff1f: renamed from eth0
    • Note: USB based network card was used during the process

Behaviors seen / debugging attempted / related forum posts found:
  1. Assembled PC (no pools of any kind on any device)
  2. Installed windows onto SATA SSD to checkout thermals while tweaking fan curves / PL1 /PL2
    1. PL1 set to 65W
    2. PL2 set to 95W
    3. Note: I wouldn't expect these to mess with anything, but it's definitely something I went monkeying with so figured I'd mention it
    4. Note: Prime95 / CinebenchR23 never broke 70C during PL1, and only broke about ~80C momentarily during PL2; I.e. I don't think there's any thermal throttling happening
  3. Installed TrueNAS Scale 22.12.0 on the SATA SSD via flash drive without issue during the install
  4. Rebooted
  5. Sluggishness experienced
    1. Note: Didn't capture this at the time, (some logs later)
    2. issue description
      1. About 10-20 seconds after Linux kernel launch, boot process started crawling -- each line printed to the screen would take tens of seconds after the last instead of milliseconds.
      2. Note: only gave this about 5-10 mins before giving up
    3. Note: What I remember:
      1. Sluggishness started a few lines after some stuff about raid (hmm.. I'm not using raid, what is this? Do I have something funny enabled in the bios? I should reset that and try again later)
      2. systemd printed logs for the first time after sluggishness started
      3. Figured I'd consult the forums before scrutinizing too much
  6. Debug - Remove USB
    1. Found a forum post where people were reporting a similar issue being resolved after migrating away from a USB boot device
    2. Removed ALL USB devices; Reboot
    3. Sluggishness still present
    4. Plugged back in USB devices
  7. Debug - Legacy USB in BIOS
    1. Found a forum post of people reporting that setting Legacy USB Support in BIOS to disabled fixed their issue
    2. Disabled "Legacy USB Support" in BIOS; Reboot
    3. Sluggishness still present
    4. Re-enabled settings in BIOS
  8. Debug - Install previous version
    1. Installed TrueNAS Scale 22.02.4 on the SATA SSD via flash drive without issue during the install
    2. Rebooted
    3. Sluggishness resolved
      1. booted at a reasonable speed, got to cli, set a few network settings, pulled web interface, clicked around, rebooted, boot within ~ a min or two
    4. Hunting through journalctl for hints -- copied some suspicious looking / colored / bold logs to come back to and start attacking once I run out of ideas...
      1. Jan 01 15:10:35 tardinas kernel: pmd_set_huge: Cannot satisfy [mem 0xc0000000-0xc0200000] with a huge-page mapping due to MTRR override.
      2. pnp 00:03: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pre>
      3. Jan 01 15:10:35 tardinas kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
      4. Jan 01 15:10:35 tardinas kernel: acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
      5. Jan 01 15:10:35 tardinas kernel: nvme nvme0: missing or invalid SUBNQN field.
      6. Jan 01 15:10:35 tardinas kernel: nvme nvme1: missing or invalid SUBNQN field.
      7. Jan 01 15:10:35 tardinas kernel: ata8.00: supports DRM functions and may not be fully accessible
      8. Jan 01 15:10:35 tardinas kernel: ata8.00: supports DRM functions and may not be fully accessible
      9. Jan 01 15:10:35 tardinas kernel: thunderbolt: probe of 0000:0e:00.0 failed with error -110
      10. (raid6 stuff here)
      11. Jan 01 15:10:35 tardinas kernel: spl: loading out-of-tree module taints kernel.
      12. Jan 01 15:10:35 tardinas kernel: znvpair: module license 'CDDL' taints kernel.
      13. Jan 01 15:10:35 tardinas kernel: Disabling lock debugging due to kernel taint
      14. Jan 01 15:10:35 tardinas blkmapd[794]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
      15. Jan 01 15:10:40 tardinas systemd-modules-load[2088]: Failed to find module 'nvidia-drm'
      16. Jan 01 15:10:43 tardinas kernel: Error: Driver 'pcspkr' is already registered, aborting...
      17. Jan 01 15:10:55 tardinas systemd[1]: exim4-base.timer: Refusing to start, unit exim4-base.service to trigger not loaded.
      18. Jan 01 15:10:55 tardinas systemd[1]: Failed to start Daily exim4-base housekeeping.
      19. Jan 01 15:10:55 tardinas kernel: ipmi_si: Unable to find any System Interface(s)
      20. Jan 01 15:10:55 tardinas systemd[1]: nslcd.service: Failed with result 'exit-code'.
      21. Jan 01 15:10:55 tardinas systemd[1]: Failed to start LSB: LDAP connection daemon.
      22. Jan 01 15:10:55 tardinas systemd[1]: openipmi.service: Failed with result 'exit-code'.
      23. Jan 01 15:10:55 tardinas systemd[1]: Failed to start LSB: OpenIPMI Driver init script.
      24. Jan 01 15:11:02 tardinas systemd[1]: ntp.service: Failed with result 'exit-code'.
      25. Jan 01 15:11:06 tardinas avahi-daemon[2591]: XML_ParseBuffer() failed at line 1: no element found.
      26. Jan 01 15:11:06 tardinas avahi-daemon[2591]: Failed to load service group file /services/DEV_INFO.service, ignoring.
      27. Jan 01 15:11:06 tardinas avahi-daemon[2591]: XML_ParseBuffer() failed at line 1: no element found.
      28. Jan 01 15:11:06 tardinas avahi-daemon[2591]: Failed to load service group file /services/HTTP.service, ignoring.
      29. Jan 01 15:11:41 tardinas kernel: snd_hda_intel 0000:00:1f.3: Cannot probe codecs, giving up
      30. Jan 01 15:15:36 tardinas systemd-xdg-autostart-generator[7326]: Not generating service for XDG autostart app-at\x2dspi\x2ddbus\x2dbus-autostart.service, erro>
  9. Debug - upgrade from previous version
    1. With TrueNAS Scale 22.02.4 still installed, upgraded to TrueNAS Scale 22.12.0 on the SATA SSD via flash drive without issue during the install (Install in new boot environment)
      1. Error noticed "cannot create 'boot-pool/grub': dataset already exists"
        1. Seems benign
      2. Reboot
      3. Sluggishness present
      4. Let sit over night (8.7 hours)
      5. Boot never completed
        1. Last line of output was "[ 1335.xxxxxx] systemd[1]: Started Journal Service."
          1. Comment: So it apparently gave up after 22 mins?
        2. Unplugged / replugged a USB device, which was logged to the screen immediately without delay
          1. Comment: Seems like at least some parts of the kernel are up & are responsive..
      6. Reboot
      7. Hand typed out some of the output to contrast 22.02.4 vs 22.12.0 w meld:
        1. Screenshot from 2023-01-02 13-22-17.png
          1. Notes/comments from log comparison between TrueNAS versions:
            1. Processes logging around sluggishness start
              1. ZFS was one of the last things to load before sluggishness started
              2. systemd is the first thing to start outputting after sluggishness started
                1. Unclear from logs when systemd kicked off
            2. Early boot timing difference
              1. ZFS loads 15 seconds earlier
            3. Version updates between versions
              1. ZFS v2.1.5-1 -> v2.1.6-1
              2. systemd 247.3-6 -> 247.3.7+deb11u1
              3. fuse 7.32 -> 7.34
            4. New systemd logs
              1. "Mounting NFSD configuration filesystem"
              2. "Starting Wait on Disk Enumeration..."
      8. Reboot
      9. From grub, pressing 'e' to edit (view) boot parameters of v22.12.0
        1. Notes/comments from Linux kernel boot param comparison between TrueNAS versions:
          1. Removed in v22.12.0: "systemd.unified_cgroup_hierarchy=0"
          2. Added in v22.12.0: "nvme_core.multipath=N"
  10. Debug - kernel parameters
    1. Add back in systemd option
      1. From grub,
        1. press e on v22.12.0 (temporary edit)
        2. To kernel params (line starting with Linux)
          1. Append " systemd.unified_cgroup_hierarchy=0" to end of line
        3. press ctrl-x to launch
      2. Sluggishness present
      3. Reboot
    2. Remove nvme option
      1. From grub,
        1. press e on v22.12.0 (temporary edit)
        2. To kernel params (line starting with Linux)
          1. Delete "nvme_core.multipath=N"
        3. press Ctrl-x to launch
      2. Sluggishness present
      3. Reboot
    3. Add in systemd & remove nvme option
      1. Sluggishness present
  11. Debug - Pulling hardware
    1. ALL USB devices (already ran this test, but re-baselining...)
      1. Removed:
        1. USB Keyboard
        2. USB Mouse
        3. USB to Eth
        4. (removed flash drive a while ago, but that's out too)
      2. Sluggishness present
    2. PCIe card (Leaving previous HW removed)
      1. Removed
        1. PCIe: Synology 10Gb Ethernet and M.2 Adapter card E10M20-T1
          1. Note: NVMe drives are not populated on this. I.e., I'm removing 10G ethernet, but not the NVMe devices
      2. Sluggishness present
    3. NVMe Drive (Leaving previous HW removed)
      1. Removed
        1. NVMe: Crucial - P3 2TB Internal SSD
      2. Sluggishness present
    4. NVMe Drive (Leaving previous HW removed)
      1. Removed
        1. NVMe: Crucial - Inland Premium 2TB SSD M.2
      2. Sluggishness present
    5. SATA Drive (Leaving previous HW removed)
      1. Waited ~3m after Linux kernel launch
      2. Removed
        1. SATA: Samsung 870 EVO Series 2.5" 250GB
          1. Llogs show ata8 down @ 209s
      3. Sluggishness present
        1. Logs show much the same from 209s - 502s when it finally needed the drive again and complained boot-pool was missing
    6. Plugged SATA Drive back in..
    7. Front panel internal IO connectors (Leaving previous HW removed)
      1. Removed
        1. HD Audio
        2. USB 3.2 Gen 2x2 Type-C
        3. USB 3.2 Gen 1
      2. Sluggishness present
    8. So at this point, the exhaustive list of motherboard connections are:
      1. Power Supply
      2. Processor
      3. RAM
      4. SATA to boot drive
      5. 3x PWM fans
      6. Front Panel header Power pins
      7. HDMI
    9. Uhh... I guess I could drop down 32g of ram
    10. RAM (Leaving previous HW removed)
      1. Removed
        1. 1x stick 32g ram
      2. Derp.. motherboard reset my bios since it detected change in ram... blew away secure boot disabled setting & prevented grub from launching Linux kernel.. that was going to be my next step anyways...
        1. Plugging in keyboard
        2. In bios, Setting secure boot to Disabled
        3. Removing keyboard
      3. Sluggishness present
      4. Since this is as stripped as I'm able to get my system, waiting the 20 mins to see if it is able to boot any further...
        1. Issue still present
          1. 41 mins later output shows
            1. [ 1331.884581] systemd[1]: Started Journal Service.
          2. Plugged in & removed USB device & got some printing out to the screen
  12. Debug - Reset BIOS
    1. (Issue present - See last section)
  13. Debug - BIOS - Disable integrated peripherals
    1. Settings > Advanced > Integrated Peripherals
      1. Onboard LAN Controller: changed to "Disabled"
      2. Onboard CNVi Module Control: change to "Disable Integrated"
      3. HD Audio Controller: changed to "Disabled"
      4. Note: everything is disabled on this screen except for "VGA Detection" set to "Auto" instead of "Ignore"
    2. Save changes & reboot
    3. Sluggishness present
  14. Debug - BIOS - Enable Thunderbolt
    1. Note: something thunderbolt was being logged as an error just prior the sluggishness starting; in BIOS it was set to disabled
    2. Settings > Advanced > Intel (R) Thunderbolt
      1. Discrete Thunderbolt(TM) Support: changed to "Enabled"
    3. Save changes & reboot
      1. Note: Says there's a whole slew of changes associated -- didn't realize this was switching from integrated to (huh... wouldn't you know it...) "discrete" lol
    4. After grub, got "error out of memory"
    5. Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
    6. Reset BIOS
      1. Re-disabled secure boot
      2. Re-disabled Integrated peripherals
    7. Back to sluggish behavior (Ya-haaa, I'm a llama again!)
  15. Debug - Retrying 22.02.4 with 22.12.0 upgraded on top of it
    1. (as mentioned in #9 above, "Install in new boot environment" was selected during install)
    2. Booted to grub
    3. Sluggishness present
    4. Still hangs on Started Journal Service (hit "Started Journal Service" at 1286; I waited until 1927, so 10m after that point)
    5. Comment: Alright... I thought I had a rough idea what "in a new boot environment" meant, but I hadn't anticipated it stomping on something related to this issue... hmm
    6. Comment: Luckily my dmesg capture had the boot args of the 22.02.04 version before the install on top of it..
    7. Meld of before / after boot environment
      1. Screenshot from 2023-01-02 18-44-01.png
        1. Comment: I hadn't realized the install clobbered this one with the systemd & nvme deltas... huh
        2. Comment: In all of the installs I've selected the 115200, not sure if that's of note... Hmm... I wouldn't think that order of consoles would matter, but I can't say I've tried that yet....
        3. Oh. My. Goodness.
        4. I just temp swapped the order in grub in 22.02.4, and it booted right up!
  16. Debug - Swapping consoles in grub
    1. Boot to grub
    2. Temp edit grub
      1. press 'e' on 22.12.0 option
      2. On the line starting with Linux,
        1. Delete " console=tty1"
        2. After "console=ttyS0,115200", type " console=tty1"
      3. Press Ctrl+X
    3. Holy heck. It worked
    4. lolwut

Welp... I was literally about to give up after that & just hit post thread... Umm...........

Is this expect?
Is this an acceptable fix to roll into a perm fix for TrueNAS?
Is there a preferred method to patch this for ourselves?

Thanks in advance!

(Note: Ohhhhhhhhh, silly me... I bet it's not hung; I bet it's actually sitting on the cli to configure the network -- When I was checking for vital signs, I was plugging in an Ethernet cord and waiting to see a MAC come through on the switch, so I was thinking it hadn't finished booting......... huh... At any rate. 20+ min boot is no good, especially given that the configuration is probably going to serial... which *double checks manual* this motherboard doesn't have to my knowledge.)
 

Stolen_Walnut

Dabbler
Joined
Jan 1, 2023
Messages
10
Scavenging through the forums some more...

Resolution here, IF you don't need a serial console: (turn off serial via GUI)

Experimenting some more...

[input] console kernel args from grub[output] Get serial boot logging[output] Get graphical boot logging[output] Slow boot behavior (~20m)Comments
""? (I think this is a no)
Yes

Issue not present
Serial Console: Disabled from GUI configures to this (see link)
"console=tty1"? (likely no)YesIssue not present
"console=ttyS0"? (likely yes)NoIssue present
"console=tty1 console=ttyS0,115200"? (likely yes)YesIssue presentDefault installation w baud option

(space character between consoles declaration args)
"console=ttyS0,115200 console=tty1"? (likely yes)YesIssue not present(space character between consoles declaration args)

I was about to say it makes me a bit uncomfortable to just disable boot logging all together, but apparently you get that without any console args to the kernel. Neato.

I don't have a serial header on this motherboard, so I can't test that, but assuming this last option doesn't botch things for the serial output, it seems like a clear winner to use for the default install so people don't run into this issue. Any chance anyone can prove out that assumption?
 
Top