Housekeeping functionalities in Scale 22.02.0/RELEASE still missing thus impacting perceived performance

mkarwin · Mar 20, 2022

Due to my system, despite BETA2 version, being in constant use for several private longer running projects, I could not upgrade earlier. Still, now once one of those has completed, I could dedicate a weekend for BETA2 to RC2, and afterwards to RELEASE, update. Apart from ending in the "another update to RELEASE available" issue that has already been reported here on multiple occassions, after the upgrades I've been experiencing performance issues.
It appears that once system gets rebooted, it's taking more than a few hours for pool.import_on_boot job hanging at 80%. OK, it's completed later, either due to some timeout set on the job (I can only guess there is some timer on some job executions) or due to it completing action after 26h. It took several more hours for some services to become available (SMB service could not be started/was not running for nearly 33h, reporting pages are yet to become available/show charts/data). Furthermore, the GUI/WebUI has some issues, also visible on shell actions on server (in ssh session), possibly stemming from:

amount of docker images kept by default/design - the lack of automated prunning I've already raised in here seems to hurt ongoing docker images dealing actions performance
amount of ix-applications sub-datasets/directories that are being created by TrueNAS - possibly these are created for and by docker/kubernetes for all those images and related configs, but maybe also created by the middleware in its regular operations, regardless there are a lot of those, but more importantly these impact the #3 below,
amount of snapshots taken on filesystems/directories/datasets from #2 - in my case I've got 2 daily "auto*" snapshot creations defined for non-empty snapshots with a 14 days worth of snapshotting history on my datasets, that's roughly 28 snapshots I have scheduled myself, then by TrueNAS design boot pool has snapshots taken on updates/upgrades so let's say approximately <10 further snapshots, so a managable amount, but when checking the number of snapshots I'm seeing some huge value of 47k, nearly all in ix-applications dataset - I've run
Code:
```
zfs list -t snapshot -H -r -o name,creation,used -s name -s creation
```
to obtain the output attached as zfs-snapshots.txt here (it's taken over 40 mins to get this ouput).

As shown in the attachment, there are thousands of snapshots taken for the ix-applications datasets, many of which report 0B of used space (guess an empty snapshot created). It's my guess that these 3 points lead to the storage management aspects to take so long, so a (re)boot + import pool takes hours and days to complete.
I'm not seeing any option to limit max number of snapshots per (child)datasets in the edit dataset pane for ix-applications, though I believe there was a property available for max snapshots count in ZFS dataset modification. Perhaps exposing such an option in GUI might have helped a bit/lot to allow dataset granularity level of said property application (possibly an automated solution would also need to be implemented to apply same value to child datasets as I believe it's not exactly automatically inheritable on ZFS level).
I'm also not seeing any option to not have TrueNAS create empty snapshots (outside of the Data Protection page, so in order to deal with the shown ix-applications snapshots) or to have those removed/squashed on an interval basis. I guess since these might be referenced elsewhere or exposed as some other filesystem, it'd be better to either stop the creation of empty snapshot in the first place, or automatic clean-up of all preceding empty snapshots.
This might improve the performance and possibly stability of the platform.
So in other words, more thorough housekeeping needs to be implemented for docker images and ix-applications filesystems and snapshots if this is to be a release version.
Other areas that are possibly hit by same root cause, at least I think so, are these:

in Dashboard -> Network pane reports "Error getting chart data", I guess it's due to that import_on_boot job hanging or timeouting - perhaps there's some lock or other impact on storage actions and due to that OS cannot get chart data - this might be related to point #5,
in Storage the whole page hangs/loads for hours, in some cases nothing is reported, I guess it's due to the job/snapshots as well, even though I can check the status when ssh'ed to the server and see the pools through
Code:
```
zpool list
```
long before the import job "ends",
in Sharing I have my NFS shares defined and the service reports as Running in the "UNIX (NFS) Shares" pane, yet when checking from my client machines I cannot see any shares reported by
Code:
```
showmount -e <server>
```
for hours upon reboot, and SMB shares are lost for even longer periods of time as I could start the SMB service only after 32 hours after the reboot - earlier it just errorred to start,
in Apps -> Installed Applications page I cannot get anything - the page simply "loads and loads"... till the UI session gets disconnected and I have to relogin and repeat the wait for the first couple of hours after the reboot - I think this stems from the snapshots count and all those images, ix-applications datasets and their snapshots,
all the Reporting subpages show nothing - neither the CPU usage/loads, nor Disk temps and I/O, nor Memory usage/swap, nor Network traffics, nor NFS stats, nor System info, nor ZFS ARC info... charts are simply empty and it's been nearly 48h since upgrade/reboot, so perhaps the storage of sar data that these charts use is still not available even after sucha long period of time, maybe due to all those snapshots that might need to be processed.

Even the UI reports in my case eg.:

Code:

Dataset moria/ix-applications/docker/0cd2c2ed70f2e54d6a2f4211084e4a011e3ece44467a936089de2c98b8cceefa has more snapshots (1357) than recommended (512). Performance or functionality might degrade.
2022-03-19 18:19:51 (Europe/Warsaw)

as an alert, but there's no GUI clear/easy option linked to manage the situation eg. to prune non-needed/empty/old snapshots in order to fix this issue. Nor do we get an exposed option in Applications -> Settings to enforce such snapshot limits or non-empty requirements or cleanups.
So here's my question/request - do you happen to know a safe way to deal with the ix-applications snapshots problem? It might solve the perceived performance and improve pool import on boot aspects. I'd rather do it safely using solution provided by TrueNAS/TrueCharts experts, GUI/WebUI solution from the main TrueNAS WebUI interface would be best, but if required I can try any shell solution to be run once SSHed into the server.

mkarwin · Mar 20, 2022

Just FYI, I've attempted using TrueTool to no avail:

Code:

MKNAS#     pip install truetool
Collecting truetool
  Downloading truetool-3.0.3-py3-none-any.whl (6.1 kB)
Installing collected packages: truetool
Successfully installed truetool-3.0.3
MKNAS#    truetool -l          
Starting TrueCharts TrueTool...

Generating Backup list...

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/midcli/command/call_mixin/__init__.py", line 36, in call
    rv = c.call(name, *args, job=job, callback=self._job_callback)
  File "/usr/lib/python3/dist-packages/midcli/command/call_mixin/__init__.py", line 36, in call
    rv = c.call(name, *args, job=job, callback=self._job_callback)
  File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 458, in call
    raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/cli", line 12, in <module>
    sys.exit(main())
  File "/usr/lib/python3/dist-packages/midcli/__main__.py", line 267, in main
    cli.run()
  File "/usr/lib/python3/dist-packages/midcli/__main__.py", line 176, in run
    self.context.process_input(self.command)
  File "/usr/lib/python3/dist-packages/midcli/context.py", line 310, in process_input
    namespace = self.current_namespace.process_input(text)
  File "/usr/lib/python3/dist-packages/midcli/context.py", line 78, in process_input
    return i.process_input(rest)
  File "/usr/lib/python3/dist-packages/midcli/context.py", line 78, in process_input
    return i.process_input(rest)
  File "/usr/lib/python3/dist-packages/midcli/context.py", line 81, in process_input
    i.process_input(rest)
  File "/usr/lib/python3/dist-packages/midcli/command/common_syntax/command.py", line 24, in process_input
    self.run(args, kwargs, interactive)
  File "/usr/lib/python3/dist-packages/midcli/command/generic_call/__init__.py", line 132, in run
    self._run_with_args(args, kwargs)
  File "/usr/lib/python3/dist-packages/midcli/command/generic_call/__init__.py", line 170, in _run_with_args
    self.call(self.method["name"], *call_args, job=self.method["job"])
  File "/usr/lib/python3/dist-packages/midcli/command/call_mixin/__init__.py", line 47, in call
    if (error := self._handle_error(e)) is not None:
  File "/usr/lib/python3/dist-packages/midcli/command/call_mixin/__init__.py", line 77, in _handle_error
    return format_error(self.context, e)
  File "/usr/lib/python3/dist-packages/midcli/middleware.py", line 8, in format_error
    if e.trace["class"] == "CallError":
TypeError: 'NoneType' object is not subscriptable
MKNAS#

so this helper app is facing the same problems the WebUI faces, and thus this tool won't be able to help with these issues here.

mkarwin · Mar 20, 2022

I'm considering using either of these:

ZFS-rollup script devised in the good old days of FreeNAS just for the purpose of snapshot management as per discussions in here and kept at author's github
ZFS-prune-snapshots script that's less TrueNAS ecosystem ingrained as it's more of a general ZFS-based one kept at author's github

but I'd rather use something that's tested/prepared by TrueNAS Scale team/users.
Just to make it clear, it's not for own created snapshots of user-managed data datasets as these seem to be created and cleared well by the periodic snapshotting functionality shown in the Data Protection UI page. It's for all the snapshots created under ix-applications datasets so for containers/kubernetes and TrueNAS purposes... cause there are thousands of these created automatically but not cleared/managed afterwards throughout server's lifetime, possibly impacting overall performance.

mkarwin · Mar 20, 2022

I've added a request/suggestion for this thing in JIRA

Frznfngrs · Mar 21, 2022

I have the same concern as you. This is craziness and needs a managed solution.

jgreco · Mar 21, 2022

Frznfngrs said:
I have the same concern as you. This is craziness and needs a managed solution.

The developers have many sharp pointy bits that they are working on. It's not craziness for the very first release of a new product to not be perfect. Vote for the Jira ticket and help make it a priority.

mkarwin · Mar 28, 2022

While it's not an official solution, given how some SCALE users eg. suggested going for

Code:

docker images prune --all --force --filter "until=24h"

once in a while to "manually" clean the system of any non-needed images, and some reported some success in improving the situation just from running

Code:

docker container prune --force --filter "until=24h"

once in a while (though if you want to combine both you'd need to first prune containers and then images, and then run

Code:

docker volume prune --force

cause volumes do not know the until filter), I've went through with docker specific commands...
I've used

Code:

docker system prune --all --force --volumes

(unfortunately if I want to clean volumes within this command I cannot use until filter) in a daily cron job (though I think I'll move it to something like a bi-weekly or monthly job) - sure it's a workaround, it's rough around the edges, it's not allowing a combined volumes and date filters, and most importantly it somewhat "breaks" the application's instant rollbacking capability (seems like TrueNAS connects "previous entries" for the app with their on filesystem sets of snapshots/clones that are removed upon docker prune), but I can kind of live with that as I'm testing app (re)deployments usually between 16:00 and 24:00 and have the daily prune run set for 03:00.
Of course, one could also combine a full stack of docker * prune for each subdomain - container, image, volume, network, cache - chained together and use the respective best options/switches to clean safer and better and granularlier, but I went with a single basic command instead. Whether you like the single though limited command or a combo of domain specifics is up to you of course.
In effect:

all the stopped/finished containers upon run have been removed/deleted and now the app restarts for the remaining active containers/pods have greatly improved in snappiness (back to how it was) - overall reported container count dropped from 2k+ to nearly 100...
thousands upon thousands of snaps, clones and filesystems/volumes have been removed along with the containers - i'm from 46.6k down to 0.6k of snaps, and in storage terms that's nearly 100GB freed...
hundreds of hanging/older versions of images have been deleted - i'm down from way over 2k to less than 20 now...
network routing to cluster and through, to and from respective containers has also improved...
docker build caches have dramatically reduced the storage aspect - down from over 3GB used to less than 1GB...
my CPU now reports under 10% on *idle* (consider idle as no active compute from running containers) with temps then quickly dropping at idle times to 40C - previously even on *idle* CPU in my server was hovering around 70% usage with temps at similar number though in degs C...

Overall, I think I'm going to become a bit happier user with this primitive workaround until a proper smarter approach is offered in TrueNAS Scale (I'm thinking like an option to mark a container/pod as test or dev or prod one to eg. keep the containers and snapshots for debug analysis or have them pruned upon container failure or after a few hours/daily for already tested and presumably working PROD ones). But since docker support's approach is basically this is by design behaviour to enable docker volumes/layers analysis on failing/broken containers (which I honestly did and still do use a lot upon building/testing my own docker images with my own docker repository) and any maintenance is to be organised elsewhere, and TrueNAS team's approach currently remains at this is docker dogmatic issue (they are right about that, it's how docker devs thought out docker, more in line with ad hoc start/stop quickly apps than those running for longer periods of time) that it's not properly cleaning after itself in the long run, I think this will do for now until better solution is devised/offered in some future version of TrueNAS (as in periodic housekeeping of any trash docker/kubernetes leftovers).
I've also replaced all the aliases for

Code:

docker run

with

Code:

docker run --rm

for any stable docker image/container for my users (to reduce stopped/finished but hanging in docker lists containers counts and reduce chances of my users generating noise/trash from impromptu/ad hoc container runs), and left the regular not clean after self on fail docker run command for the small subset of build/test deployments for debug purposes.
Hopefully my set of workarounds will help others.
Please bear in mind that this workaround clears anything docker created for non-currently running containers, so if you have some stopped app that you start only now and then you need to have it running when the prune command analyses the containers, otherwise the container/image/snaps/volumes/networks created by docker will get purged. I currently have only 2 docker images and corresponding pods that I build from scratch/source in my CI/CD pipeline for a one time single somewhat short action docker apps that are stopped most of the time, other apps are constantly running/in use, so this solution works for me. But your mileage may vary...

Important Announcement for the TrueNAS Community.

Housekeeping functionalities in Scale 22.02.0/RELEASE still missing thus impacting perceived performance

mkarwin

Dabbler

Attachments

mkarwin

Dabbler

mkarwin

Dabbler

mkarwin

Dabbler

Frznfngrs

Cadet

jgreco

Resident Grinch

mkarwin

Dabbler

Similar threads