SOLVED Rancher VM lockup (CPU#0 stuck for xx seconds) [docker:21201]

apwiggins

Dabbler
Joined
Dec 23, 2016
Messages
41
A RancherVM was built in the 11.0 series and has been pulled along into 11.2. While this Rancher VM worked well in 11.0, I'm noticing that it needs to be rebooted daily in 11.2 which means that any running containers/services get torn down as well. From the CLI, there seems to be some CPU resource locking. Any tips to resolve?

The hardware CPU is a XEON E5-2620 v4. Host hardware has 16GB RAM.
sysctl hw.physmem
hw.physmem: 17013260288


Type Container Provider
Autostart: true
Virtual CPUs: 2
Memory: 2048
BootLoader: GRUB
Com Port: /dev/nmdm6B
Description: RancherUI VM

The VM is reporting CPU#0 stuck.
View attachment 27702

When attempting to launch another VM, this possibly unrelated error is reported by a ClearLinux VM.

Error: Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 161, in call_method
result = await self.middleware.call_method(self, message)
File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1109, in call_method
return await self._call(message['method'], serviceobj, methodobj, params, app=app, io_thread=False)
File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1049, in _call
return await methodobj(*args)
File "/usr/local/lib/python3.6/site-packages/middlewared/schema.py", line 664, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/vm.py", line 1131, in start
await self.__init_guest_vmemory(vm, overcommit=overcommit)
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/vm.py", line 911, in __init_guest_vmemory
raise CallError(f'Cannot guarantee memory for guest {vm["name"]}')
middlewared.service_exception.CallError: [EFAULT] Cannot guarantee memory for guest ClearLinuxVM1
 

Attachments

  • 1547153373893.png
    1547153373893.png
    28.8 KB · Views: 337
Last edited:

KrisBee

Wizard
Joined
Mar 20, 2017
Messages
1,288
Is this relevant? https://forums.freenas.org/index.ph...i-watchdog-bug-soft-lockup-cpu-0-stuck.62473/

You have 16GB in your FN box but how much memory are you allocating to your docker VMs, and did you choose to overcommit memory?

If both VMs existed in FN11 before you upgraded, then the error message results from a change in memory management introduced in FN11.2. ( see here, but note exact mechanism may have changed since post: https://forums.freenas.org/index.ph...-throttled-when-using-virtual-machines.69046/ )
 

apwiggins

Dabbler
Joined
Dec 23, 2016
Messages
41
KrisBee...thanks for the notes and links.

So, there is no overcommitted memory. I have a Gitlab VM (Ubuntu16.04) with 3072MB RAM (shows as unprovisioned) and a RancherUI VM with 2048 MB RAM (shows as provisioned). What's the difference between provisioned/unprovisioned? I followed a similar setup process for both VMs, so what leads each VM to be provisioned or unprovisioned for memory?
1547234248574.png


The FreeNAS documentation is silent on provisioning of memory. There is only a concept of provisioning for storage. So, there's a gap in the documentation.

https://www.ixsystems.com/documenta...q=provisioned&check_keywords=yes&area=default
 

KrisBee

Wizard
Joined
Mar 20, 2017
Messages
1,288
IIRC, the only place I found any kind of definition of what "provisioned/unprovisioned" means is in the underlying VM code at "/usr/local/lib/python3.6/site-packages/middlewared/plugins/vm.py" , I seem to remember mentioning that in my link.
 

apwiggins

Dabbler
Joined
Dec 23, 2016
Messages
41
Thanks. The python script has some self-referencing description of provisioning... not tremendously helpful for understanding other than it's pulling some state. It's still unclear to me how two VMs created within FN can lead to one being provisioned and the other unprovisioned. In service provider language, provisioning usually means a resource is managed or configured based on a known limit.

async def get_vmemory_in_use(self):
"""
The total amount of virtual memory in MB used by guests
Returns a dict with the following information:
RNP - Running but not provisioned
PRD - Provisioned but not running
RPRD - Running and provisioned
"""

Anyway, my Rancher VM re-appeared on its own today. Usually I have to reboot the VM. I've talked to guys using other environments and sometimes specific CPUs (e.g. CPU0 in a multi-core) can cause virtualization issues, so they have scripts to assign VMs to certain ranges of CPUs and forbid on others in a multi-core environment.
 

bobpaul

Dabbler
Joined
Dec 20, 2012
Messages
23
Code:
        """
        The total amount of virtual memory in MB used by guests

            Returns a dict with the following information:
                RNP - Running but not provisioned
                PRD - Provisioned but not running
                RPRD - Running and provisioned
        """
        memory_allocation = {'RNP': 0, 'PRD': 0, 'RPRD': 0}
        guests = await self.middleware.call('datastore.query', 'vm.vm')
        for guest in guests:
            status = await self.status(guest['id'])
            if status['state'] == 'RUNNING' and guest['autostart'] is False:
                memory_allocation['RNP'] += guest['memory'] * 1024 * 1024
            elif status['state'] == 'RUNNING' and guest['autostart'] is True:
                memory_allocation['RPRD'] += guest['memory'] * 1024 * 1024
            elif guest['autostart']:
                memory_allocation['PRD'] += guest['memory'] * 1024 * 1024


Autostart == provisioned.
 

apwiggins

Dabbler
Joined
Dec 23, 2016
Messages
41
@bobpaul Thanks for working through the logic. It turned out to be tangential to the problem, but good to know as this info isn't in the documentation (so far).
 

apwiggins

Dabbler
Joined
Dec 23, 2016
Messages
41
The problem in the end was resolved by changing NIC to use VIRTIO interface instead of E1000 NIC interface.
 
Top