Linux Jails - Experimental Script

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Thanks. I just suspected that:
"Typically "map" is the best choice, since it transparently maps UIDs/GIDs in memory as needed without modifying the image, and without requiring an expensive recursive adjustment operation. However, it is not available for all file systems, currently."
Right. The same as FreeBSD jails.
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Just to note, there aren't any special considerations for container ID mapping into internal ZFS ACL code (e.g. zfs_zaccess*). I don't expect this to work for NFSv4 ACLs. It _may_ work for POSIX1E ACLs since I created some wrappers in that case to call generic_permission() (which is namespace-aware) in zfs_zaccess_trivial(), but that depends on whether the namespace being passed around internally in ZFS is the correct one for the particular container.

I personally haven't tested, which is kind of the rub here. You're going off into territory that I haven't tested / validated because we don't use the feature. YMMV, proceed carefully (especially if you're relying on this to provide security).

NOTE: I'm not currently looking at the code. This is recollection from the last time I worked on it.
 

Jip-Hop

Contributor
Joined
Apr 13, 2021
Messages
118
What I'd try is to create jail with this config:

Code:
systemd_nspawn_user_args=--private-users=6000:65536 --private-users-ownership=chown


Now the root user inside the jail with ID 0 should be mapped to user 6000 outside the jail. Thanks to --private-users-ownership=chown the ownership of the jail rootfs will be fixed during jail startup.

You don't need to create a user with ID 6000 in the TrueNAS interface, but of course you can do this if you like.

If you then need to bind mount a directory inside the jail, to which the root user inside the jail must have access, then you should manually recursively chown (once) all files inside the directory to be bind mounted into the jail (not the jail rootfs itself) to 6000. If you have a user with ID 1000 inside the jail which must access these files, then you should chown to 7000 instead.

I didn't test this, but this seems to me the easy way without ACLs.
 

cap

Contributor
Joined
Mar 17, 2016
Messages
122
What I'd try is to create jail with this config:

Code:
systemd_nspawn_user_args=--private-users=6000:65536 --private-users-ownership=chown

I have tried something similar before but did not post it here. It does not work.
'--private-users-ownership=chown' only seems to work with the "pick" option.

I tried it again:

Code:
Starting jail test with the following command:

systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-test --working-directory=./jails/test '--description=My nspawn jail test [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=test --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --private-users=6000:65536 --private-users-ownership=chown

Job for jlmkr-test.service failed.

See "systemctl status jlmkr-test.service" and "journalctl -xeu jlmkr-test.service" for details.

Failed to start jail test...
In case of a config error, you may fix it with:
jlmkr edit test


Edit:
# journalctl -xeu jlmkr-test.service
[...]
Jan 05 18:36:58 truenas systemd-nspawn[281965]: Automatic UID/GID adjusting is only supported for UID/GID ranges starting at multiples of 2^16 with a range of 2^16.

Code:
# journalctl -xeu jlmkr-test.service
Jan 05 18:35:58 truenas systemd-nspawn[270949]: [  OK  ] Stopped target swap.target - Swaps.
Jan 05 18:35:58 truenas systemd-nspawn[270949]: [  OK  ] Reached target umount.target - Unmount All Filesystems.
Jan 05 18:35:58 truenas systemd-nspawn[270949]: [  OK  ] Stopped systemd-remount-fs.service …ount Root and Kernel File Systems.
Jan 05 18:35:58 truenas systemd-nspawn[270949]: [  OK  ] Stopped systemd-tmpfiles-setup-dev.…reate Static Device Nodes in /dev.
Jan 05 18:35:58 truenas systemd-nspawn[270949]: [  OK  ] Reached target shutdown.target - System Shutdown.
Jan 05 18:35:58 truenas systemd-nspawn[270949]: [  OK  ] Reached target final.target - Late Shutdown Services.
Jan 05 18:35:58 truenas systemd-nspawn[270949]: [  OK  ] Finished systemd-poweroff.service - System Power Off.
Jan 05 18:35:58 truenas systemd-nspawn[270949]: [  OK  ] Reached target poweroff.target - System Power Off.
Jan 05 18:35:58 truenas systemd-nspawn[270949]: Sending SIGTERM to remaining processes...
Jan 05 18:35:58 truenas systemd-nspawn[270949]: Sending SIGKILL to remaining processes...
Jan 05 18:35:58 truenas systemd-nspawn[270949]: All filesystems, swaps, loop devices, MD devices and DM devices detached.
Jan 05 18:35:58 truenas systemd-nspawn[270949]: Powering off.
Jan 05 18:35:58 truenas systemd[1]: jlmkr-test.service: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit jlmkr-test.service has successfully entered the 'dead' state.
Jan 05 18:35:58 truenas systemd[1]: jlmkr-test.service: Consumed 1.612s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit jlmkr-test.service completed and consumed the indicated resources.
Jan 05 18:36:58 truenas systemd[1]: Starting jlmkr-test.service - My nspawn jail test [created with jailmaker]...
░░ Subject: A start job for unit jlmkr-test.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit jlmkr-test.service has begun execution.
░░
░░ The job identifier is 61303.
Jan 05 18:36:58 truenas systemd-nspawn[281965]: Automatic UID/GID adjusting is only supported for UID/GID ranges starting at multiples of 2^16 with a range of 2^16.
Jan 05 18:36:58 truenas systemd[1]: jlmkr-test.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit jlmkr-test.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Jan 05 18:36:58 truenas systemd[1]: jlmkr-test.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit jlmkr-test.service has entered the 'failed' state with result 'exit-code'.
Jan 05 18:36:58 truenas systemd[1]: Failed to start jlmkr-test.service - My nspawn jail test [created with jailmaker].
░░ Subject: A start job for unit jlmkr-test.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit jlmkr-test.service has finished with a failure.
░░
░░ The job identifier is 61303 and the job result is failed.
 

Jip-Hop

Contributor
Joined
Apr 13, 2021
Messages
118
Ah interesting. The documentation does clearly show that those two options can work together:
systemd-nspawn ... --private-users=0 --private-users-ownership=chown
But your error message is helpful. Try again with:

Code:
--private-users=65536:65536 --private-users-ownership=chown
 

cap

Contributor
Joined
Mar 17, 2016
Messages
122
Ah interesting. The documentation does clearly show that those two options can work together:

But your error message is helpful. Try again with:

Code:
--private-users=65536:65536 --private-users-ownership=chown
Seems to work:
Code:
# cat /proc/self/uid_map
         0      65536      65536


Code:
# ls -l
total 88
lrwxrwxrwx   1 root   root      7 Dec 27 06:25 bin -> usr/bin
drwxr-xr-x   2 root   root      2 Dec  9 22:08 boot
drwxr-xr-x   8 root   root    460 Jan  5 18:56 dev
drwxr-xr-x  45 root   root     99 Jan  5 18:56 etc
drwxr-xr-x   2 root   root      2 Dec  9 22:08 home
lrwxrwxrwx   1 root   root      7 Dec 27 06:25 lib -> usr/lib
lrwxrwxrwx   1 root   root      9 Dec 27 06:25 lib32 -> usr/lib32
lrwxrwxrwx   1 root   root      9 Dec 27 06:25 lib64 -> usr/lib64
lrwxrwxrwx   1 root   root     10 Dec 27 06:25 libx32 -> usr/libx32
drwxr-xr-x   2 root   root      2 Dec 27 06:25 media
drwxr-xr-x   2 root   root      2 Dec 27 06:25 mnt
drwxr-xr-x   2 root   root      2 Dec 27 06:25 opt
dr-xr-xr-x 503 nobody nogroup   0 Jan  5 18:56 proc
drwx------   3 root   root      5 Dec 27 06:25 root
drwxr-xr-x  12 root   root    340 Jan  5 18:57 run
lrwxrwxrwx   1 root   root      8 Dec 27 06:25 sbin -> usr/sbin
drwxr-xr-x   2 root   root      2 Dec 27 06:25 srv
dr-xr-xr-x  13 nobody nogroup   0 Dec 31 16:59 sys
drwxrwxrwt   8 root   root    160 Jan  5 18:56 tmp
drwxr-xr-x  14 root   root     14 Dec 27 06:25 usr
drwxr-xr-x  11 root   root     13 Dec 27 06:25 var


Edit:
Even "systemd_nspawn_user_args=--private-users=131072:65536 --private-users-ownership=chown' seems to work.
31072 = multiple of 2^16 (= 56536).

Edit:
From the top of the code block:
[...]
dr-xr-xr-x 503 nobody nogroup 0 Jan 5 18:56 proc
[...]
dr-xr-xr-x 13 nobody nogroup 0 Dec 31 16:59 sys
 
Last edited:

Jip-Hop

Contributor
Joined
Apr 13, 2021
Messages
118
I think those proc and sys directories showing as nobody nogroup is normal. A quick google search:
 
  • Like
Reactions: cap

cap

Contributor
Joined
Mar 17, 2016
Messages
122
I just wanted to install Docker and it doesn't work.
I think I'll give up for now and use Jailmaker as normal.
 

Jip-Hop

Contributor
Joined
Apr 13, 2021
Messages
118
Not sure what you've tried exactly but running docker inside a rootless jail seems to work for me.

Config file has these required settings:
Code:
docker_compatible=1
[...]
systemd_nspawn_user_args=--network-bridge=br1 --resolv-conf=bind-host --private-users=65536:65536 --private-users-ownership=chown


Then inside the jail I do:
Code:
cd /tmp/
apt update
apt install curl
curl -fsSL https://get.docker.com -o install-docker.sh
sudo sh install-docker.sh
docker run --name nginx -p 80:80 nginx


And I can access the nginx welcome page on port 80 on the IP assigned by DHCP to my rootless jail.

According to the nspawn does not add capabilities with --private-users=pick issue on GitHub you can't use host networking for this purpose, so that's why I recommend to use --network-bridge.
 

cap

Contributor
Joined
Mar 17, 2016
Messages
122
I don't think you need ACLs for this basic scenario. I have removed all ACLs from my datasets and just use regular Unix filesystem permissions. Have zero trouble using these files inside jailmaker jails. I can even share these datasets simultaneously via SMB with ACL disabled.
I would like to stay with the ACLs for now.
I can access a bind mount from a jail without any problems, regardless of whether the jail is privileged or unprivileged (a corresponding user must be created in Scale for an unprivileged jail).

There are problems with the bind mounts from a Docker.

I have created a user with the UID 3050 in TrueNAS Scale. In the browser I can't get a connection for Jellyfin (or Emby) when I create a PUID = 3050 and PGID = 3050 for their Docker. As a test, I have installed Wps-office in Docker. Wps-office works with a PUID = 3050 and PGID = 3050 and I can access the bind-mount. But I haven't tried this with unprivileg jail yet.
This is probably a special Jellyfin (Emby) problem?
If I do not change the PUID and PGID (in the template used, PUID = 1024 and PGID = 100), Jellyfin works. But Jellyfin itself can't seem to look into the bind mounts (even if you can include them in Jellyfin).

I also found the following:

Mastering Systemd-Nspawn: How to Effectively Bind a Folder with IDMap on ZFS​


Edit:
"I don't think you need ACLs for this basic scenario. I have removed all ACLs from my datasets and just use regular Unix filesystem permissions. Have zero trouble using these files inside jailmaker jails. I can even share these datasets simultaneously via SMB with ACL disabled."

OK, this seems to solve the above mentioned Jellyfin problems in a normal jailmaker jail.
 
Last edited:

snicke

Explorer
Joined
May 5, 2015
Messages
74
Edit:
"I don't think you need ACLs for this basic scenario. I have removed all ACLs from my datasets and just use regular Unix filesystem permissions. Have zero trouble using these files inside jailmaker jails. I can even share these datasets simultaneously via SMB with ACL disabled."

OK, this seems to solve the above mentioned Jellyfin problems in a normal jailmaker jail.

For the bind mounts to work in a rootless "jail", I think you need to create a UID in TrueNAS SCALE with ID (x + 65536), if you used
--private-users=65536:65536 --private-users-ownership=chown in the jail config, where x is the corresponding UID in the jail.

I.e. a user with UID 3050 in a jail I think should have UID (3050 + 65536) = 68586 in TrueNAS SCALE for a working bind mount.
 

cap

Contributor
Joined
Mar 17, 2016
Messages
122
For the bind mounts to work in a rootless "jail", I think you need to create a UID in TrueNAS SCALE with ID (x + 65536), if you used
--private-users=65536:65536 --private-users-ownership=chown in the jail config, where x is the corresponding UID in the jail.

I.e. a user with UID 3050 in a jail I think should have UID (3050 + 65536) = 68586 in TrueNAS SCALE for a working bind mount.
Does not work.
"Automatic UID/GID adjusting is only supported for UID/GID ranges starting at multiples of 2^16 with a range of 2^16."
68586 != multiples of 2^16
 

snicke

Explorer
Joined
May 5, 2015
Messages
74
Does not work.
"Automatic UID/GID adjusting is only supported for UID/GID ranges starting at multiples of 2^16 with a range of 2^16."
68586 != multiples of 2^16
No, but 65536 = multiples of 2^16 and that is the key here. I.e. UID 0 in the "jail" maps to UID 65536 in TrueNAS SCALE and hence e.g. UID 3050 in the "jail" maps to UID 68586 in TrueNAS SCALE etc.
 

cap

Contributor
Joined
Mar 17, 2016
Messages
122
No, but 65536 = multiples of 2^16 and that is the key here. I.e. UID 0 in the "jail" maps to UID 65536 in TrueNAS SCALE and hence e.g. UID 3050 in the "jail" maps to UID 68586 in TrueNAS SCALE etc.
This works with ACLs (NFSv4 permissions)!!!
For POSIX ACLs (Unix Permissions) this is not necessary.
This is not necessary for POSIX ACLs (Unix Permissions).

Note:
- I had mixed up directories earlier.
- The browser cache once displayed something old in Jellyfin/Emby, which is why I couldn't connect.
 

snicke

Explorer
Joined
May 5, 2015
Messages
74
Indeed great initiative here @Jip-Hop with your very promising script. Thanks for your efforts so far.

I'm about to migrate from TrueNAS Core, where I've been using jails for a long time, to TrueNAS Scale, now when IXSystems seems to slowly abandon Core. I've truly enjoyed the versatile and powerful, yet lightweight on resources, jails on Core but are now migrating jail after jail to Docker on my Linux laptop as a proof of concept before upgrading the server itself from Core to Scale. Kubernetes is for sure an interesting technology and I wouldn't mind learning it to benefit from it professionally later on, but for a home NAS, with only a handful of users, it feels overkill. K8s seems to shine when you want to scale things for hundreds or thousands of users, which is surely a use case for TrueNAS Enterprise users. But for the TrueNAS community users my feeling is that a neat way to run Docker is much more straightforward and useful. I understand that IXSystems need to focus on their business customers but I'm sure they also understand their crucial symbios with the community throughout the years for bringing the product to where it is today. Hence I think it's also very important for IXSystems to focus on the community's needs in combination with the business customers' needs. And the community need is clear: An easy and lightweight way to be able to run Docker containers without risking the TrueNAS system integrity.

I really hope that IXSystems (ping @morganL, @Kris Moore et. al.) will start acknowledge that and start supporting your efforts here @Jip-Hop and incorporate your script into core (pun intended) TrueNAS Scale. This "iocage" like way of bringing "jails" to TrueNAS Scale, thanks to your script initiative here, feels like what should have been done by IXSystems from the very beginning considering all users that are familiar with the concept through all the years with TrueNAS (and FreeNAS) core. This in parallell with the K8s path for the Enterprise users.

I'm a bit worried about the security part though. My feeling is that the way forward with this initiative is to harden that part to make it easier to argue that it should be included as a part of TrueNAS Scale itself. The most obvious start I think would be to ensure that the "jails" are not run as root by default. Hence make things like
Code:
--private-users=65536:65536 --private-users-ownership=chown
a default setting in the script and make sure that it runs smoothly and becomes well documented. Maybe have running as a root as an option but not as default. What do you think about this idea?
 
Last edited:

Jip-Hop

Contributor
Joined
Apr 13, 2021
Messages
118
So far only @cap has reported trying out jailmaker with rootless jails. So before considering to make this the default, I think it requires more testing and documentation.

I see these possible complications:

- docker info inside a rootless jail shows "Native Overlay Diff: false" (not sure if this is normal for docker in a rootless container...)
- can't use host networking when running docker in rootless jail (have to resort to macvlan or bridge networking, which requires specific hardware or additional setup)
- can't bind mount directories from available ZFS datasets as-is (have to chown first or possibly alter ACLs, assuming existing files are not already owned by UIDs in the range from 65536 to 131072).

I always welcome additional testing or (documentation) contributions!
 
Top