Kubernetes & Cables....the challenges!

sos_nz · May 7, 2023

As a new TrueNAS SCALE user, I'm impressed by the power and flexibility of the software, and the excellent NAS features.

However, as the saying goes, with great power (apps) comes great responsibility - and I have needed two TrueNAS SCALE reinstalls in the past week on my homelab setup purely due to failure of the kubernetes/app system. The first time was possibly my issue for restarting while a task was running and appearing to hang.

The second, just now, came after simply trying to install nginx from TrueCharts, removing it (after it failed to deploy), and then all my other apps (nextcloud TrueCharts, syncthing official, Jellyfin official) were also hanging at "deploying" for more than an hour. After rebooting, the kubernetes system wouldn't run due to "Failed to configure kubernetes cluster for Applications: [EFAULT] Missing 'hugetlb, cpu, cpuset' cgroup controller(s) which are required for apps to function".

I tried unsetting & resetting the pool, deleting ix-applications, forcing an update to the CIDR from advanced settings, all to no avail.

*frustrated*.

EDIT: Original title of thread was "Kubernetes - not ready for primetime" since I was having all sorts of issues. As it turned out, in my case it was a hardware failure...however, there are enough other threads on here with the same or similar error messages that hopefully this thread will be of use to people :)

Patrick M. Hausen · May 7, 2023

Maybe try TrueNAS CORE and jails?

sos_nz · May 7, 2023

Patrick M. Hausen said:
Maybe try TrueNAS CORE and jails?

Heh. I am considering this...I presume I can install CORE, then it can import my existing pool?

morganL · May 7, 2023

sos_nz said:
As a new TrueNAS SCALE user, I'm impressed by the power and flexibility of the software, and the excellent NAS features.

However, as the saying goes, with great power (apps) comes great responsibility - and I have needed two TrueNAS SCALE reinstalls in the past week on my homelab setup purely due to failure of the kubernetes/app system. The first time was possibly my issue for restarting while a task was running and appearing to hang.

The second, just now, came after simply trying to install nginx from TrueCharts, removing it (after it failed to deploy), and then all my other apps (nextcloud TrueCharts, syncthing official, Jellyfin official) were also hanging at "deploying" for more than an hour. After rebooting, the kubernetes system wouldn't run due to "Failed to configure kubernetes cluster for Applications: [EFAULT] Missing 'hugetlb, cpu, cpuset' cgroup controller(s) which are required for apps to function".

I tried unsetting & resetting the pool, deleting ix-applications, forcing an update to the CIDR from advanced settings, all to no avail.

*frustrated*.

Sorry to hear of your experience, but it's a good warning to everyone else.

No single App can be viewed as bug free... they are complex scripts. The developers can make a mistake. Only testing by large numbers of users finds each issue.

There is a large diversity of Apps...if they come from different Catalogs, they will not all have been tested together. There can be bugs or incompatibilities in the Apps/charts.

There is a lot of ways of configuring Apps, IP networking and datasets - not all have been tested. Some work, but others may not. The number of bugs in TrueNAS and in the Kubernetes software has been reduced, but its not perfect. There's another 200 bug fixes coming in 22.12.3.

Where something does go wrong, it is not always easy to identify and resolve the issue (we agree TrueNAS needs to get better at this). It may be a bug or a misconfiguration. However, TrueNAS does preserve the data that has been set-up (can be reimported by SCALE or CORE)

If Apps are deployed systematically with a plan... then when something goes wrong, its appreciated that the problem be written up and if appropriate a bug get reported. The engineering team appreciates debugs along with clear descriptions of how to reproduce.

So, I would agree that a user deploying a wide diversity of Apps may encounter issues... so this/your use-case is not "prime-time" and requires technical skills. Its very much an "early Adopter" use-case.

A user deploying a few Apps from the Official or Enterprise Catalogs is less likely to encounter a serious issue. However, we are yet to recommend SCALE for Conservative or Mission critical users.

We document the status of the software on this status page. Better luck with your next deployment...we're here to help.

Patrick M. Hausen · May 7, 2023

sos_nz said:
I presume I can install CORE, then it can import my existing pool?

Yes.

sos_nz · May 7, 2023

Thanks for the considered replies to my frustrated post :)

Am working through / learning SCALE as I go - and am glad it's a home lab, not enterprise case, for sure!

sos_nz · May 9, 2023

OK - have done a fresh reinstall of SCALE and imported my pool and re-established my SMB share (shared with Linux, Mac and Windows clients), along with a single user and group (ID 3000). Also setup the networking and timzone/locale info.

Next up, I set the Kubernetes 'advanced settings' Node IP (0.0.0.0), my network adapter (enp36S0), gateway (router IP) and CIDR/Cluster IP/ DNS (172.16.0.0/16, 172.17.0.0/16, 172.17.0.1) and set the pool. Kubernetes is now reinitialising on the fresh pool ...

...Nope. Just got this error:

"Error: [EFAULT] Missing 'cpu, hugetlb, cpuset' cgroup controller(s) which are required for apps to function"

On a brand new install, with no pre-existing 'ix-applications' dataset.

What the...? I haven't even been able to mess anything up by installing my first app yet!

MisterE2002 · May 9, 2023

Please, the next time (and maybe also for this issue). Also look in /var/log/*.log files. This sometimes give a hint. Or make a debug file and sent to iX.
That said, last time i wanted to follow the kubernetes stuff it was logging a *lot* of unuseful crap.

sos_nz · May 9, 2023

On attempting to set the pool (it's currently unset), I get:
"Error: [EINVAL] kubernetes_update.force: Missing '/mnt/mypool/ix-applications/config.json' configuration file. Specify force to override this and let system re-initialize applications."

I've read various threads here with people having some success in altering the Cluster CIDR address, then ticking "force" and "save", but for me that doesn't seem to be doing anything, including with reboots in between the steps. Is altering the CIDR and checking "force" the "force to override" that the error message is referring to? Is there another CLI way to fix this?

I might try again to create a new pool (with a 16GB external USB) just for the purposes of recreating the ix-applications folder on it, then, as some folks have suggested, copying the missing config.json file across to the main pool.

EDIT:
So I deleted the ix-applications dataset again, chose the pool, and it's re-installed the ix-applications folder apparently successfully. But when I try to install an app:

Code:

 Error: Traceback (most recent call last):  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 426, in run    await self.future  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 461, in __run_body    rv = await self.method(*([self] + args))  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1186, in nf    res = await f(*args, **kwargs)  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1318, in nf    return await func(*args, **kwargs)  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/chart_release.py", line 397, in do_create    await self.middleware.call('kubernetes.validate_k8s_setup')  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1386, in call    return await self._call(  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1335, in _call    return await methodobj(*prepared_call.args)  File "/usr/lib/python3/dist-packages/middlewared/plugins/kubernetes_linux/update.py", line 513, in validate_k8s_setup    raise CallError(error) middlewared.service_exception.CallError: [EFAULT] Kubernetes service is not running.

I'll head off to work for a few hours and see what it does after a reboot...

EDIT:
After a reboot, I'm now getting this error:

Failed to start kubernetes cluster for Applications: [EFAULT] Unable to configure node: Cannot connect to host 127.0.0.1:6443 ssl:default [Connect call failed ('127.0.0.1', 6443)]

Haven't seen this one before! Any thoughts?

morganL · May 9, 2023

sos_nz said:
On attempting to set the pool (it's currently unset), I get:
"Error: [EINVAL] kubernetes_update.force: Missing '/mnt/mypool/ix-applications/config.json' configuration file. Specify force to override this and let system re-initialize applications."

I've read various threads here with people having some success in altering the Cluster CIDR address, then ticking "force" and "save", but for me that doesn't seem to be doing anything, including with reboots in between the steps. Is altering the CIDR and checking "force" the "force to override" that the error message is referring to? Is there another CLI way to fix this?

I might try again to create a new pool (with a 16GB external USB) just for the purposes of recreating the ix-applications folder on it, then, as some folks have suggested, copying the missing config.json file across to the main pool.

EDIT:
So I deleted the ix-applications dataset again, chose the pool, and it's re-installed the ix-applications folder apparently successfully. But when I try to install an app:

Code:
Error: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/middlewared/job.py", line 426, in run await self.future File "/usr/lib/python3/dist-packages/middlewared/job.py", line 461, in __run_body rv = await self.method(*([self] + args)) File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1186, in nf res = await f(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1318, in nf return await func(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/chart_release.py", line 397, in do_create await self.middleware.call('kubernetes.validate_k8s_setup') File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1386, in call return await self._call( File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1335, in _call return await methodobj(*prepared_call.args) File "/usr/lib/python3/dist-packages/middlewared/plugins/kubernetes_linux/update.py", line 513, in validate_k8s_setup raise CallError(error) middlewared.service_exception.CallError: [EFAULT] Kubernetes service is not running.

I'll head off to work for a few hours and see what it does after a reboot...

EDIT:
After a reboot, I'm now getting this error:

Failed to start kubernetes cluster for Applications: [EFAULT] Unable to configure node: Cannot connect to host 127.0.0.1:6443 ssl:default [Connect call failed ('127.0.0.1', 6443)]

Haven't seen this one before! Any thoughts?

Its beyond my paygrade.

Its seems likely that something is corrupted (bug or config+bug) in your system. Its probably hard to diagnose because of the history of issues.. we don't know when an issue was introduced.

Its easier to handle 1st order problems... systems is operating well and then ONE PROBLEM occurs. We can then focus on that issue knowing the system was in a good state.

If you are stuck, feel free to report a bug.... but I doubt the engineering team can reproduce. If they know of a cause of this fault they may be able to provide you a recovery process,

sos_nz · May 10, 2023

morganL said:
Its beyond my paygrade.

How about $5 for a new ethernet cable? When something, like the above, doesn't make sense...look for a hardware problem.

I had some fun and installed CORE last night & had a good look around. Feels nice, stable, functional and subjectively just a bit more 'snappy' than SCALE when moving around various tasks. In a word "mature". This comes at expense of some of the new bells & whistles of SCALE (for example NFS4 permissions, which I needed in order to share a Syncthing folder with my SMB share). "Modern" would be my one word comparison for SCALE.

Anyway, to cut a long cable short, I noticed in the dashboard interface that my ethernet was only connected @ 100TX, and occasionally cutting out altogether. Swapping for a known good cable -> 1000mbit.

Encouraged, I reinstalled SCALE to see if that was the issue (it's *very* helpful that one can import datasets so readily between CORE and SCALE and on reinstalls).

All seems well so far with Kubernetes - Jellyfin back up & running. Syncthing & Nextcloud next. And I'll try not to fiddle with host paths & permissions etc to the point of breaking the installation again.

Fingers crossed.

sos_nz · May 11, 2023

Last update, then I'll leave you all in peace.

My SCALE install is up and running with HA Proxied nextcloud (truecharts), syncthing and jellyfin. All working like a charm!

nemesis1782 · Oct 15, 2023

morganL said:
Its beyond my paygrade.

Its seems likely that something is corrupted (bug or config+bug) in your system. Its probably hard to diagnose because of the history of issues.. we don't know when an issue was introduced.

Its easier to handle 1st order problems... systems is operating well and then ONE PROBLEM occurs. We can then focus on that issue knowing the system was in a good state.

If you are stuck, feel free to report a bug.... but I doubt the engineering team can reproduce. If they know of a cause of this fault they may be able to provide you a recovery process,

This is a fairly common issue and really easy to reproduce! I'll be looking into it in the coming week and expect it has to do with the ixsystems configs and scripting not performing proper error handling.

For instance install a few apps, wait till the jobs are done. Add another app, reboot before the task has completed. Chances are no app aorks anymore.

Rebooting is usually not something you do often of course. That one apps failure effects another is problematic though. This is neither a docker nor a Kubernetes problem though since I've never seen a configuration error in either a container, pod or deployment effect anything else running atm. Well at least as long as you don't do anything stupid.

Patrick M. Hausen said:
Maybe try TrueNAS CORE and jails?

Is that really the same thing though? Genuinely asking.

Patrick M. Hausen · Oct 15, 2023

nemesis1782 said:
Is that really the same thing though? Genuinely asking.

No it's not, that's the point. It's the oldest and most robust container technology out there with way less quirky networking compared to the mess presented by kubernetes. My opinion, of course.

nemesis1782 · Oct 15, 2023

Patrick M. Hausen said:
No it's not, that's the point. It's the oldest and most robust container technology out there with way less quirky networking compared to the mess presented by kubernetes. My opinion, of course.

Well I mean it's a different beast. I know loads of old colleges who start twitching upon the mention of jails. Due to their prior frustrations with it.

Kubernetes is just a orchestrator though, TrueNAS uses docker. This has little to do with Kubernetes although it's a popular combination.

Kubernetes can be extremely robust, if setup correctly. It can also be a big steaming pile of, you know ;)

I haven't looked into it yet but I suspect the issue is not with Kubernetes but the ixsystems scripting.

Personally I appreciate the convenience both docker and Kubernetes bring. I do not mind the overhead of containerization compared to Jails, since compared to VMs the overhead is trivial. I will say that if anyone runs it professionally I hope they understand the risks of containerization, few people do regrettably.

Networking is the thing I mostly have to assist with often indeed workwise. I agree it's overly convoluted. It does bring nice benefits though, which could also be achieved by hardware networking infrastructures.

Patrick M. Hausen · Oct 15, 2023

Jails are simple - a directory tree with a complete FreeBSD user land installed. You "boot" that on the host's kernel in a jailed (i.e. containerized) environment complete with virtual network stack and all. Each jail has a lo0 interface internally and regularly at least one external interface that is connected to some wire via a bridge. You can SSH into it, bridge it to a VLAN completely different from your host's connection, whatever ...

Best combined with ZFS - if that directory tree is at least one dedicated ZFS dataset you can snapshot, replicate, rollback, ...

That's why I wayyy... prefer them over Docker. I run more than 1000 of them in production.

nemesis1782 · Oct 15, 2023

True, jails are way simpler, personally I haven't used them recently (recently being 15 years give or take) and I'm not comfortable enough with the technology anymore.

Docker and Kubernetes on the other hand I use configure and debug regularly.

For my use case and most, jails would most probably suffice. That does not mean Kubernetes or Docker do not have benefits for other use cases, there are many where plain jails would be impractical.

For work I love Docker and Kubernetes because I can set everything up add scripting and configuration management and I have a easy out of the box thing to run systems in whatever configuration I need including stubs. I then distribute it to the team so they can dev in peace and not need to worry about the complexity. This would be a lot more difficult with jails. Not trying to argue just saying it all depends on the use case.

In this case you might be spot on, barring the fact that you seem to have a fair bit of knowledge on the subject and many may even struggle with the most basic command line interface. In the end most people get overwhelmed and/or don't want to be bothered to learn.

Not sure how easy it is to setup a jail on Truenas core, or FreeBSD currently. I do know that there were quite some opportunities to screw stuff up in the past. A quick search tells me with TreuNAS CORE it's trivial it seems.

I might spin up a core test rig somewhere to play with, I'm curious now. See what you did :P

Patrick M. Hausen · Oct 15, 2023

I also see use cases for Kubernetes etc. The problems start with "organisations who think they are Google"

The ones who realise they are not Google and simply want a managed LAMP/FAMP stack with regular updates and "everything and the kitchen sink included" are frequently our customers and rent jails for hosting their web applications.

Our devs use Docker on their Macs - secretly starting a hypervisor with a Linux VM to run those workloads, but who cares as long as it works?

Our CTO supports the FreeBSD + jails approach of the hosting team and even said something along the lines "if you cannot develop a PHP thingy on Linux and then deploy on FreeBSD and just adapt, you have more fundamental problems". We have a set of Ansible plays for all the moving parts that cover the local Docker dev environment, customers who insist on running Linux, and our own infrastructure running FreeBSD.

nemesis1782 · Oct 15, 2023

Patrick M. Hausen said:
I also see use cases for Kubernetes etc. The problems start with "organisations who think they are Google"

The ones who realise they are not Google and simply want a managed LAMP/FAMP stack with regular updates and "everything and the kitchen sink included" are frequently our customers and rent jails for hosting their web applications.

Hehe, yup. To true... Everyone just wants the next new buzzword and thinks they're the next big thing.

What pisses me off the most about Docker, Kubernetes and even modern programming languages is the newer generations using them. How often I've heard. But why do I need to understand what's going on under the hood?

You know maybe because you want the software that's going to be running in a factory, hospital or whatever critical environment to be stable and predictable.

Anyway I digress... Sorry, and thanks for the enjoyable conversation.

nemesis1782 · Oct 15, 2023

Patrick M. Hausen said:
Our devs use Docker on their Macs - secretly starting a hypervisor with a Linux VM to run those workloads, but who cares as long as it works?

Agreed, the convenience to the devs is measurable. Speaking from experience.

Patrick M. Hausen said:
Our CTO supports the FreeBSD + jails approach of the hosting team and even said something along the lines "if you cannot develop a PHP thingy on Linux and then deploy on FreeBSD and just adapt, you have more fundamental problems". We have a set of Ansible plays for all the moving parts that cover the local Docker dev environment, customers who insist on running Linux, and our own infrastructure running FreeBSD.

True.

Important Announcement for the TrueNAS Community.

Kubernetes & Cables....the challenges!

Explorer

Hall of Famer

Explorer

Captain Morgan

Hall of Famer

Explorer

Explorer

Patron

Explorer

Failed to start kubernetes cluster for Applications: [EFAULT] Unable to configure node: Cannot connect to host 127.0.0.1:6443 ssl:default [Connect call failed ('127.0.0.1', 6443)]​

Captain Morgan

Failed to start kubernetes cluster for Applications: [EFAULT] Unable to configure node: Cannot connect to host 127.0.0.1:6443 ssl:default [Connect call failed ('127.0.0.1', 6443)]​

Explorer

Explorer

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Kubernetes & Cables....the challenges!"

Similar threads

Failed to start kubernetes cluster for Applications: [EFAULT] Unable to configure node: Cannot connect to host 127.0.0.1:6443 ssl:default [Connect call failed ('127.0.0.1', 6443)]

Failed to start kubernetes cluster for Applications: [EFAULT] Unable to configure node: Cannot connect to host 127.0.0.1:6443 ssl:default [Connect call failed ('127.0.0.1', 6443)]