middlewared setting up plugins failover_ python 3.8 exited signal 11 core dumped

Squigley

Dabbler
Joined
May 13, 2020
Messages
19
I have just done a brand new fresh install of TrueNAS12.0 using the ISO I downloaded an hour ago, on a VM running under KVM, on a Dell R710 (very similar software and hardware config to another R710 I have, which runs FreeNAS with no issues). VM is configured with 16 CPUs/copy host config (CPUs are E5520) and 10GB RAM.

The "disks" I have attached are 3 partitions on an SSD on the SATA controller. TrueNAS was installed on the first "disk".

Host OS is Ubuntu 20.04 server.

The VM has 3 PCI passthrough devices attached, 2 NICs (the onboard ones), and an LSI RAID controller with a 2008 chip crossflashed into HBA mode. IOMMU and all that is configured correctly, as far as I can tell, based on having done it on the other r710, and seeing that the vfio module is bound to the devices and the DMARR output etc.

On the first (and every) boot, it gets to a point where it is loading the plugins, and plugin 53/54 fails, saying "setting up plugins (failover_) pid XXX (python 3.8), jid 0, uid 0: exited on signal 11 (core dumped).

The pid changes slightly each time, but is around 200 each time (202, 206, 210, 211 etc).

Screenshot:

1607543988654.png


It then hangs here for a minute or 2, before continuing, and I see it output the..

##############################################################
MIDDLEWARED FAILED TO START, SYSTEM WILL NOT BEHAVE CORRECTLY!
##############################################################

like I have seen in other people's posts (when searching for this issue).

It then keeps whipping along outputting stuff, before some more "failed to run middleware call. Daemon not running?" errors.

I login as root, and get a python stack trace:

1607544421635.png


This appears that it's not able to connect to itself?

I tried disconnecting the RAID controller PCI device (as it was also causing "gptzfsboot: error 128 lba <some block #> " error), in case it was related, which made no difference, to either problem.

I also tried connecting a bridged ethernet adapter, as well as the 2 real PCI NICs, but that made no difference.

I also tried with only the bridged ethernet adapter attached, and no PCI devices (NICs nor RAID) attached, but it made no difference, other than when the failed start timed out, it gets an IP via DHCP, generated an SSH host key etc: (the PCI NICs are not yet patched, so DHCP would have failed, but I don't think that is related, since the middlewared failure occurs before DHCP is attempted).

1607545178840.png


The DNS error occurred after leaving the system at the login prompt for a little while. Strange, since it should have valid DNS as I saw it get a valid IP via DHCP earlier in the boot process.

Logging in as root gives the identical python stack trace as before, including the "self.sock.connect(self.bind_addr) error.

I'm at a loss. Is there something dumb I have misconfigured, or not configured at all, as necessary?
 

Squigley

Dabbler
Joined
May 13, 2020
Messages
19
Not that it's relevant to the issue, but pinging 0.freebsd.pool.ntp.org from the shell works..
 

Squigley

Dabbler
Joined
May 13, 2020
Messages
19
For fun, I installed FreeNAS 11.3 U5 on the same VM. It appeared to work fine.

I left it running for a while, but when I checked the console later, I noticed it had the same issue:

1607559133911.png
 

Squigley

Dabbler
Joined
May 13, 2020
Messages
19
I created a new VM on my existing R710 which has been running FreeNAS for years, based on the same hardware configuration.

After the install, and on the first boot, I did not see the issue. However, when I tried to connect to the IP it had received via DHCP in my browser, the console immediately spewed..

1607737222147.png

and it's stuck and never connects..

1607737359431.png
 

Squigley

Dabbler
Joined
May 13, 2020
Messages
19
I tried creating a new VM from scratch, and performing a fresh install.. The result was the same.

I just download U1, and installed that in the new VM, and..

1607983708985.png


Sigh.

I tried logging in as root, and restarting it manually, it just hangs for minutes, before telling me so.

1607984099668.png


I find this in the log..
1607984150866.png

1607984195680.png


Is there anything I can do to attempt to got more information from whatever the python script is that is failing?
 

Squigley

Dabbler
Joined
May 13, 2020
Messages
19
OK..

So, the CPU in the VM was set to "Nehalem", which was the default it was set to after creating the VM, I didn't do anything to change that.

Since there's not much else I could fiddle with that I hadn't already tried, with no change, I changed the CPU to "Hypervisor default", which changes it to "qemu64". I booted up the VM, other that noticing it saying 'Unrecognised CPU", middlewared _didn't_ hang/crash/fail.

I shut it down, and set it to "copy host CPU configuration", and powered it on, which set it to "Nehalem IBRS", and what do you know? middleware setting up plugins failover_ core dumps again.
 

Squigley

Dabbler
Joined
May 13, 2020
Messages
19
I am wondering if this is something to do with the specific CPUs I have in the machine:

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 26
Model name: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
Stepping: 5
CPU MHz: 1595.947
BogoMIPS: 4521.84
Virtualization: VT-x
L1d cache: 256 KiB
L1i cache: 256 KiB
L2 cache: 2 MiB
L3 cache: 16 MiB
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15
Vulnerability Itlb multihit: KVM: Vulnerable
Vulnerability L1tf: Mitigation; PTE Inversion; VMX vulnerable
Vulnerability Mds: Vulnerable; SMT vulnerable
Vulnerability Meltdown: Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxs
r sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopo
logy nonstop_tsc cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca
sse4_1 sse4_2 popcnt lahf_lm ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm
ida flush_l1d

and all of the vulnerabilities, even though I have "mitigations=off" set in /etc/default/grub..

I have ordered some replacement CPUs X something, already, to upgrade this..
 

Valen

Cadet
Joined
Feb 15, 2021
Messages
2
Just a heads up I have had the same issue with the exact same CPU.
Did you wind up resolving the problem without changing the cpu model?
Thanks a whole bunch for posting this btw. Got me out of a jam.
I have a feeling it's something to do with vmx instructions just from earlier errors but it's entirely possible I'm wrong about that.
 

Squigley

Dabbler
Joined
May 13, 2020
Messages
19
I think that after I changed it back to Nehalem (non IBRS), it started working properly. I don't recall exactly, just that fiddling with the CPU setting got it to work. I shortly after received the Xeon X CPUs and installed those and changed the VM to copy the host config and it remained working properly, at least until it corrupted the ZFS pool, twice, at which point I switched to FreeNAS and have stayed there because TrueNAS seems unstable to me.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Nehalem has a hardware limit on guest VMs. See https://www.truenas.com/community/threads/cant-add-more-than-cpu-to-vm.55397/. You may have it that limit initially, instead of running on bare metal. As for running FreeNAS/TrueNAS as a VM, did you follow this recommendation?

 
Top