SOLVED Rancher VM lockup (CPU#0 stuck for xx seconds) [docker:21201]

apwiggins · Jan 10, 2019

A RancherVM was built in the 11.0 series and has been pulled along into 11.2. While this Rancher VM worked well in 11.0, I'm noticing that it needs to be rebooted daily in 11.2 which means that any running containers/services get torn down as well. From the CLI, there seems to be some CPU resource locking. Any tips to resolve?

The hardware CPU is a XEON E5-2620 v4. Host hardware has 16GB RAM.
sysctl hw.physmem
hw.physmem: 17013260288

Type Container Provider
Autostart: true
Virtual CPUs: 2
Memory: 2048
BootLoader: GRUB
Com Port: /dev/nmdm6B
Description: RancherUI VM

The VM is reporting CPU#0 stuck.
View attachment 27702

When attempting to launch another VM, this possibly unrelated error is reported by a ClearLinux VM.

Error: Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 161, in call_method
result = await self.middleware.call_method(self, message)
File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1109, in call_method
return await self._call(message['method'], serviceobj, methodobj, params, app=app, io_thread=False)
File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1049, in _call
return await methodobj(*args)
File "/usr/local/lib/python3.6/site-packages/middlewared/schema.py", line 664, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/vm.py", line 1131, in start
await self.__init_guest_vmemory(vm, overcommit=overcommit)
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/vm.py", line 911, in __init_guest_vmemory
raise CallError(f'Cannot guarantee memory for guest {vm["name"]}')
middlewared.service_exception.CallError: [EFAULT] Cannot guarantee memory for guest ClearLinuxVM1

KrisBee · Jan 11, 2019

Is this relevant? https://forums.freenas.org/index.ph...i-watchdog-bug-soft-lockup-cpu-0-stuck.62473/

You have 16GB in your FN box but how much memory are you allocating to your docker VMs, and did you choose to overcommit memory?

If both VMs existed in FN11 before you upgraded, then the error message results from a change in memory management introduced in FN11.2. ( see here, but note exact mechanism may have changed since post: https://forums.freenas.org/index.ph...-throttled-when-using-virtual-machines.69046/ )

apwiggins · Jan 11, 2019

KrisBee...thanks for the notes and links.

So, there is no overcommitted memory. I have a Gitlab VM (Ubuntu16.04) with 3072MB RAM (shows as unprovisioned) and a RancherUI VM with 2048 MB RAM (shows as provisioned). What's the difference between provisioned/unprovisioned? I followed a similar setup process for both VMs, so what leads each VM to be provisioned or unprovisioned for memory?

The FreeNAS documentation is silent on provisioning of memory. There is only a concept of provisioning for storage. So, there's a gap in the documentation.

https://www.ixsystems.com/documenta...q=provisioned&check_keywords=yes&area=default

KrisBee · Jan 11, 2019

IIRC, the only place I found any kind of definition of what "provisioned/unprovisioned" means is in the underlying VM code at "/usr/local/lib/python3.6/site-packages/middlewared/plugins/vm.py" , I seem to remember mentioning that in my link.

apwiggins · Jan 11, 2019

Thanks. The python script has some self-referencing description of provisioning... not tremendously helpful for understanding other than it's pulling some state. It's still unclear to me how two VMs created within FN can lead to one being provisioned and the other unprovisioned. In service provider language, provisioning usually means a resource is managed or configured based on a known limit.

async def get_vmemory_in_use(self):
"""
The total amount of virtual memory in MB used by guests
Returns a dict with the following information:
RNP - Running but not provisioned
PRD - Provisioned but not running
RPRD - Running and provisioned
"""

Anyway, my Rancher VM re-appeared on its own today. Usually I have to reboot the VM. I've talked to guys using other environments and sometimes specific CPUs (e.g. CPU0 in a multi-core) can cause virtualization issues, so they have scripts to assign VMs to certain ranges of CPUs and forbid on others in a multi-core environment.

bobpaul · Jan 12, 2020

Code:

        """
        The total amount of virtual memory in MB used by guests

            Returns a dict with the following information:
                RNP - Running but not provisioned
                PRD - Provisioned but not running
                RPRD - Running and provisioned
        """
        memory_allocation = {'RNP': 0, 'PRD': 0, 'RPRD': 0}
        guests = await self.middleware.call('datastore.query', 'vm.vm')
        for guest in guests:
            status = await self.status(guest['id'])
            if status['state'] == 'RUNNING' and guest['autostart'] is False:
                memory_allocation['RNP'] += guest['memory'] * 1024 * 1024
            elif status['state'] == 'RUNNING' and guest['autostart'] is True:
                memory_allocation['RPRD'] += guest['memory'] * 1024 * 1024
            elif guest['autostart']:
                memory_allocation['PRD'] += guest['memory'] * 1024 * 1024

Autostart == provisioned.

apwiggins · Jan 15, 2020

@bobpaul Thanks for working through the logic. It turned out to be tangential to the problem, but good to know as this info isn't in the documentation (so far).

apwiggins · Jan 15, 2020

The problem in the end was resolved by changing NIC to use VIRTIO interface instead of E1000 NIC interface.

Important Announcement for the TrueNAS Community.

SOLVED Rancher VM lockup (CPU#0 stuck for xx seconds) [docker:21201]

apwiggins

Dabbler

Attachments

KrisBee

Wizard

apwiggins

Dabbler

KrisBee

Wizard

apwiggins

Dabbler

bobpaul

Dabbler

apwiggins

Dabbler

apwiggins

Dabbler

Similar threads