Stolen_Walnut
Dabbler
- Joined
- Jan 1, 2023
- Messages
- 10
I appear to be tripping over a boot issue in Truenas scale 22.12.0 that apparently wasn't present in 22.02.4.
High level issue description:
HW:
Behaviors seen / debugging attempted / related forum posts found:
Welp... I was literally about to give up after that & just hit post thread... Umm...........
Is this expect?
Is this an acceptable fix to roll into a perm fix for TrueNAS?
Is there a preferred method to patch this for ourselves?
Thanks in advance!
(Note: Ohhhhhhhhh, silly me... I bet it's not hung; I bet it's actually sitting on the cli to configure the network -- When I was checking for vital signs, I was plugging in an Ethernet cord and waiting to see a MAC come through on the switch, so I was thinking it hadn't finished booting......... huh... At any rate. 20+ min boot is no good, especially given that the configuration is probably going to serial... which *double checks manual* this motherboard doesn't have to my knowledge.)
High level issue description:
- 10-20 seconds into the boot process (after Linux kernel launch), boot becomes sluggish:
- Boot process being output to screen stops taking milliseconds and starts taking tens of seconds for each line
- System does not complete boot process. (waited 8+ hours)
- Last output line of boot process "[ 1335.985838] systemd[1]: Started Journal Service."
- System still logs USB plug/unplug events to screen
HW:
- Motherboard make and model
- MSI MEG Z690I UNIFY
- CPU make and model
- Intel Core i5-12500
- RAM quantity
- G.Skill Ripjaws S5 64GB (2 x 32GB) DDR5-5200
- Hard drives, quantity, model numbers, and RAID configuration, including boot drives
- 3x:
- boot: SATA SSD - Samsung 870 EVO Series 2.5" 250GB
- no-pool-assignment NVMe: Inland Premium 2TB SSD M.2
- no-pool-assignment NVMe: Crucial - P3 2TB Internal SSD
- 3x:
- Hard disk controllers
- 3x integrated/on-board SSD/NVMe controllers:
- lspci -nnn | grep -E "0106|0108"
- 00:17.0 SATA controller [0106]: Intel Corporation Device [8086:7ae2] (rev 11)
- 09:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology Device [c0a9:540a] (rev 01)
- 13:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation PS5013 E13 NVMe Controller [1987:5013] (rev 01)
- lspci -nnn | grep -E "0106|0108"
- 3x integrated/on-board SSD/NVMe controllers:
- Network cards
- 3x:
- lspci -nnn | grep 0200
- 08:00.0 Ethernet controller [0200]: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] [1d6a:07b1] (rev 02)
- 0b:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 03)
- USB based:
- Belkin USB 2.0 Ethernet Adapter F4U047
- from dmesg on plugin:
- [ 7071.443790] usb 1-2: new high-speed USB device number 11 using xhci_hcd
- [ 7071.613305] usb 1-2: New USB device found, idVendor=0b95, idProduct=7720, bcdDevice= 0.01
- [ 7071.623043] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
- [ 7071.631132] usb 1-2: Product: AX88x72A
- [ 7071.635473] usb 1-2: Manufacturer: ASIX Elec. Corp.
- [ 7071.641140] usb 1-2: SerialNumber: 0DFF1F
- [ 7072.067353] asix 1-2:1.0 eth0: register 'asix' at usb-0000:00:14.0-2, ASIX AX88772 USB 2.0 Ethernet, 00:50:b6:0d:ff:1f
- [ 7072.082502] asix 1-2:1.0 enx0050b60dff1f: renamed from eth0
- from dmesg on plugin:
- Belkin USB 2.0 Ethernet Adapter F4U047
- lspci -nnn | grep 0200
- Note: USB based network card was used during the process
- 3x:
Behaviors seen / debugging attempted / related forum posts found:
- Assembled PC (no pools of any kind on any device)
- Installed windows onto SATA SSD to checkout thermals while tweaking fan curves / PL1 /PL2
- PL1 set to 65W
- PL2 set to 95W
- Note: I wouldn't expect these to mess with anything, but it's definitely something I went monkeying with so figured I'd mention it
- Note: Prime95 / CinebenchR23 never broke 70C during PL1, and only broke about ~80C momentarily during PL2; I.e. I don't think there's any thermal throttling happening
- Installed TrueNAS Scale 22.12.0 on the SATA SSD via flash drive without issue during the install
- Rebooted
- Sluggishness experienced
- Note: Didn't capture this at the time, (some logs later)
- issue description
- About 10-20 seconds after Linux kernel launch, boot process started crawling -- each line printed to the screen would take tens of seconds after the last instead of milliseconds.
- Note: only gave this about 5-10 mins before giving up
- Note: What I remember:
- Sluggishness started a few lines after some stuff about raid (hmm.. I'm not using raid, what is this? Do I have something funny enabled in the bios? I should reset that and try again later)
- systemd printed logs for the first time after sluggishness started
- Figured I'd consult the forums before scrutinizing too much
- Debug - Remove USB
- Found a forum post where people were reporting a similar issue being resolved after migrating away from a USB boot device
- Removed ALL USB devices; Reboot
- Sluggishness still present
- Plugged back in USB devices
- Debug - Legacy USB in BIOS
- Found a forum post of people reporting that setting Legacy USB Support in BIOS to disabled fixed their issue
- Disabled "Legacy USB Support" in BIOS; Reboot
- Sluggishness still present
- Re-enabled settings in BIOS
- Debug - Install previous version
- Installed TrueNAS Scale 22.02.4 on the SATA SSD via flash drive without issue during the install
- Rebooted
- Sluggishness resolved
- booted at a reasonable speed, got to cli, set a few network settings, pulled web interface, clicked around, rebooted, boot within ~ a min or two
- Hunting through journalctl for hints -- copied some suspicious looking / colored / bold logs to come back to and start attacking once I run out of ideas...
- Jan 01 15:10:35 tardinas kernel: pmd_set_huge: Cannot satisfy [mem 0xc0000000-0xc0200000] with a huge-page mapping due to MTRR override.
- pnp 00:03: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pre>
- Jan 01 15:10:35 tardinas kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
- Jan 01 15:10:35 tardinas kernel: acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
- Jan 01 15:10:35 tardinas kernel: nvme nvme0: missing or invalid SUBNQN field.
- Jan 01 15:10:35 tardinas kernel: nvme nvme1: missing or invalid SUBNQN field.
- Jan 01 15:10:35 tardinas kernel: ata8.00: supports DRM functions and may not be fully accessible
- Jan 01 15:10:35 tardinas kernel: ata8.00: supports DRM functions and may not be fully accessible
- Jan 01 15:10:35 tardinas kernel: thunderbolt: probe of 0000:0e:00.0 failed with error -110
- (raid6 stuff here)
- Jan 01 15:10:35 tardinas kernel: spl: loading out-of-tree module taints kernel.
- Jan 01 15:10:35 tardinas kernel: znvpair: module license 'CDDL' taints kernel.
- Jan 01 15:10:35 tardinas kernel: Disabling lock debugging due to kernel taint
- Jan 01 15:10:35 tardinas blkmapd[794]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
- Jan 01 15:10:40 tardinas systemd-modules-load[2088]: Failed to find module 'nvidia-drm'
- Jan 01 15:10:43 tardinas kernel: Error: Driver 'pcspkr' is already registered, aborting...
- Jan 01 15:10:55 tardinas systemd[1]: exim4-base.timer: Refusing to start, unit exim4-base.service to trigger not loaded.
- Jan 01 15:10:55 tardinas systemd[1]: Failed to start Daily exim4-base housekeeping.
- Jan 01 15:10:55 tardinas kernel: ipmi_si: Unable to find any System Interface(s)
- Jan 01 15:10:55 tardinas systemd[1]: nslcd.service: Failed with result 'exit-code'.
- Jan 01 15:10:55 tardinas systemd[1]: Failed to start LSB: LDAP connection daemon.
- Jan 01 15:10:55 tardinas systemd[1]: openipmi.service: Failed with result 'exit-code'.
- Jan 01 15:10:55 tardinas systemd[1]: Failed to start LSB: OpenIPMI Driver init script.
- Jan 01 15:11:02 tardinas systemd[1]: ntp.service: Failed with result 'exit-code'.
- Jan 01 15:11:06 tardinas avahi-daemon[2591]: XML_ParseBuffer() failed at line 1: no element found.
- Jan 01 15:11:06 tardinas avahi-daemon[2591]: Failed to load service group file /services/DEV_INFO.service, ignoring.
- Jan 01 15:11:06 tardinas avahi-daemon[2591]: XML_ParseBuffer() failed at line 1: no element found.
- Jan 01 15:11:06 tardinas avahi-daemon[2591]: Failed to load service group file /services/HTTP.service, ignoring.
- Jan 01 15:11:41 tardinas kernel: snd_hda_intel 0000:00:1f.3: Cannot probe codecs, giving up
- Jan 01 15:15:36 tardinas systemd-xdg-autostart-generator[7326]: Not generating service for XDG autostart app-at\x2dspi\x2ddbus\x2dbus-autostart.service, erro>
- Debug - upgrade from previous version
- With TrueNAS Scale 22.02.4 still installed, upgraded to TrueNAS Scale 22.12.0 on the SATA SSD via flash drive without issue during the install (Install in new boot environment)
- Error noticed "cannot create 'boot-pool/grub': dataset already exists"
- Seems benign
- Reboot
- Sluggishness present
- Let sit over night (8.7 hours)
- Boot never completed
- Last line of output was "[ 1335.xxxxxx] systemd[1]: Started Journal Service."
- Comment: So it apparently gave up after 22 mins?
- Unplugged / replugged a USB device, which was logged to the screen immediately without delay
- Comment: Seems like at least some parts of the kernel are up & are responsive..
- Last line of output was "[ 1335.xxxxxx] systemd[1]: Started Journal Service."
- Reboot
- Hand typed out some of the output to contrast 22.02.4 vs 22.12.0 w meld:
-
- Notes/comments from log comparison between TrueNAS versions:
- Processes logging around sluggishness start
- ZFS was one of the last things to load before sluggishness started
- systemd is the first thing to start outputting after sluggishness started
- Unclear from logs when systemd kicked off
- Early boot timing difference
- ZFS loads 15 seconds earlier
- Version updates between versions
- ZFS v2.1.5-1 -> v2.1.6-1
- systemd 247.3-6 -> 247.3.7+deb11u1
- fuse 7.32 -> 7.34
- New systemd logs
- "Mounting NFSD configuration filesystem"
- "Starting Wait on Disk Enumeration..."
- Processes logging around sluggishness start
- Notes/comments from log comparison between TrueNAS versions:
-
- Reboot
- From grub, pressing 'e' to edit (view) boot parameters of v22.12.0
- Notes/comments from Linux kernel boot param comparison between TrueNAS versions:
- Removed in v22.12.0: "systemd.unified_cgroup_hierarchy=0"
- Added in v22.12.0: "nvme_core.multipath=N"
- Notes/comments from Linux kernel boot param comparison between TrueNAS versions:
- Error noticed "cannot create 'boot-pool/grub': dataset already exists"
- With TrueNAS Scale 22.02.4 still installed, upgraded to TrueNAS Scale 22.12.0 on the SATA SSD via flash drive without issue during the install (Install in new boot environment)
- Debug - kernel parameters
- Add back in systemd option
- From grub,
- press e on v22.12.0 (temporary edit)
- To kernel params (line starting with Linux)
- Append " systemd.unified_cgroup_hierarchy=0" to end of line
- press ctrl-x to launch
- Sluggishness present
- Reboot
- From grub,
- Remove nvme option
- From grub,
- press e on v22.12.0 (temporary edit)
- To kernel params (line starting with Linux)
- Delete "nvme_core.multipath=N"
- press Ctrl-x to launch
- Sluggishness present
- Reboot
- From grub,
- Add in systemd & remove nvme option
- Sluggishness present
- Add back in systemd option
- Debug - Pulling hardware
- ALL USB devices (already ran this test, but re-baselining...)
- Removed:
- USB Keyboard
- USB Mouse
- USB to Eth
- (removed flash drive a while ago, but that's out too)
- Sluggishness present
- Removed:
- PCIe card (Leaving previous HW removed)
- Removed
- PCIe: Synology 10Gb Ethernet and M.2 Adapter card E10M20-T1
- Note: NVMe drives are not populated on this. I.e., I'm removing 10G ethernet, but not the NVMe devices
- PCIe: Synology 10Gb Ethernet and M.2 Adapter card E10M20-T1
- Sluggishness present
- Removed
- NVMe Drive (Leaving previous HW removed)
- Removed
- NVMe: Crucial - P3 2TB Internal SSD
- Sluggishness present
- Removed
- NVMe Drive (Leaving previous HW removed)
- Removed
- NVMe: Crucial - Inland Premium 2TB SSD M.2
- Sluggishness present
- Removed
- SATA Drive (Leaving previous HW removed)
- Waited ~3m after Linux kernel launch
- Removed
- SATA: Samsung 870 EVO Series 2.5" 250GB
- Llogs show ata8 down @ 209s
- SATA: Samsung 870 EVO Series 2.5" 250GB
- Sluggishness present
- Logs show much the same from 209s - 502s when it finally needed the drive again and complained boot-pool was missing
- Plugged SATA Drive back in..
- Front panel internal IO connectors (Leaving previous HW removed)
- Removed
- HD Audio
- USB 3.2 Gen 2x2 Type-C
- USB 3.2 Gen 1
- Sluggishness present
- Removed
- So at this point, the exhaustive list of motherboard connections are:
- Power Supply
- Processor
- RAM
- SATA to boot drive
- 3x PWM fans
- Front Panel header Power pins
- HDMI
- Uhh... I guess I could drop down 32g of ram
- RAM (Leaving previous HW removed)
- Removed
- 1x stick 32g ram
- Derp.. motherboard reset my bios since it detected change in ram... blew away secure boot disabled setting & prevented grub from launching Linux kernel.. that was going to be my next step anyways...
- Plugging in keyboard
- In bios, Setting secure boot to Disabled
- Removing keyboard
- Sluggishness present
- Since this is as stripped as I'm able to get my system, waiting the 20 mins to see if it is able to boot any further...
- Issue still present
- 41 mins later output shows
- [ 1331.884581] systemd[1]: Started Journal Service.
- Plugged in & removed USB device & got some printing out to the screen
- 41 mins later output shows
- Issue still present
- Removed
- ALL USB devices (already ran this test, but re-baselining...)
- Debug - Reset BIOS
- (Issue present - See last section)
- Debug - BIOS - Disable integrated peripherals
- Settings > Advanced > Integrated Peripherals
- Onboard LAN Controller: changed to "Disabled"
- Onboard CNVi Module Control: change to "Disable Integrated"
- HD Audio Controller: changed to "Disabled"
- Note: everything is disabled on this screen except for "VGA Detection" set to "Auto" instead of "Ignore"
- Save changes & reboot
- Sluggishness present
- Settings > Advanced > Integrated Peripherals
- Debug - BIOS - Enable Thunderbolt
- Note: something thunderbolt was being logged as an error just prior the sluggishness starting; in BIOS it was set to disabled
- Settings > Advanced > Intel (R) Thunderbolt
- Discrete Thunderbolt(TM) Support: changed to "Enabled"
- Save changes & reboot
- Note: Says there's a whole slew of changes associated -- didn't realize this was switching from integrated to (huh... wouldn't you know it...) "discrete" lol
- After grub, got "error out of memory"
- Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
- Reset BIOS
- Re-disabled secure boot
- Re-disabled Integrated peripherals
- Back to sluggish behavior (Ya-haaa, I'm a llama again!)
- Debug - Retrying 22.02.4 with 22.12.0 upgraded on top of it
- (as mentioned in #9 above, "Install in new boot environment" was selected during install)
- Booted to grub
- Sluggishness present
- Still hangs on Started Journal Service (hit "Started Journal Service" at 1286; I waited until 1927, so 10m after that point)
- Comment: Alright... I thought I had a rough idea what "in a new boot environment" meant, but I hadn't anticipated it stomping on something related to this issue... hmm
- Comment: Luckily my dmesg capture had the boot args of the 22.02.04 version before the install on top of it..
- Meld of before / after boot environment
-
- Comment: I hadn't realized the install clobbered this one with the systemd & nvme deltas... huh
- Comment: In all of the installs I've selected the 115200, not sure if that's of note... Hmm... I wouldn't think that order of consoles would matter, but I can't say I've tried that yet....
- Oh. My. Goodness.
- I just temp swapped the order in grub in 22.02.4, and it booted right up!
-
- Debug - Swapping consoles in grub
- Boot to grub
- Temp edit grub
- press 'e' on 22.12.0 option
- On the line starting with Linux,
- Delete " console=tty1"
- After "console=ttyS0,115200", type " console=tty1"
- Press Ctrl+X
- Holy heck. It worked
- lolwut
Welp... I was literally about to give up after that & just hit post thread... Umm...........
Is this expect?
Is this an acceptable fix to roll into a perm fix for TrueNAS?
Is there a preferred method to patch this for ourselves?
Thanks in advance!
(Note: Ohhhhhhhhh, silly me... I bet it's not hung; I bet it's actually sitting on the cli to configure the network -- When I was checking for vital signs, I was plugging in an Ethernet cord and waiting to see a MAC come through on the switch, so I was thinking it hadn't finished booting......... huh... At any rate. 20+ min boot is no good, especially given that the configuration is probably going to serial... which *double checks manual* this motherboard doesn't have to my knowledge.)