Hello TrueNAS community,
Specs first:
- 2 x GTX 1080TIs
- Ryzen 9 5900X
- 64Gb ram
- ASUS Rog Crosshair VIII Hero
TrueNAS-SCALE-22.02-MASTER-20211201-012921
I set up TrueNAS scale a week back now and have been tinkering with it since.
When initially setting it up, i deployed the official plex chart and assigned it one of my 1080Tis.
This was working, transcoding jobs were handled by the GPU just fine.
Sometime after, i also spun up a Windows VM, trying to assign that the other 1080Ti.
This obviously did not work, as TrueNAS requires one of the GPUs, Plex had the other.
After this, neither of the GPUs are available to be assigned to any of the k3s charts. They still show up under System Settings -> Advanced -> Isolated GPU Devices.
They also show up if i try creating a new VM.
However 0 GPUs are listed as available in kubernetes:
Checking the nvidia-device-plugin-daemonset pod:
Trying to run nvidia-smi:
lspci:
As the above suggests, the system is not able to load the nvidia driver. I came across another thread and attempted manually installing the driver with apt install nvidia-cuda-dev nvidia-cuda-toolkit, which almost bricked the system.
Any suggestions or help would be highly appreciated!
Specs first:
- 2 x GTX 1080TIs
- Ryzen 9 5900X
- 64Gb ram
- ASUS Rog Crosshair VIII Hero
TrueNAS-SCALE-22.02-MASTER-20211201-012921
I set up TrueNAS scale a week back now and have been tinkering with it since.
When initially setting it up, i deployed the official plex chart and assigned it one of my 1080Tis.
This was working, transcoding jobs were handled by the GPU just fine.
Sometime after, i also spun up a Windows VM, trying to assign that the other 1080Ti.
This obviously did not work, as TrueNAS requires one of the GPUs, Plex had the other.
After this, neither of the GPUs are available to be assigned to any of the k3s charts. They still show up under System Settings -> Advanced -> Isolated GPU Devices.
They also show up if i try creating a new VM.
However 0 GPUs are listed as available in kubernetes:
Code:
Capacity: cpu: 24 ephemeral-storage: 200329472Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65761324Ki nvidia.com/gpu: 0 pods: 110 Allocatable: cpu: 24 ephemeral-storage: 194880510209 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65761324Ki nvidia.com/gpu: 0 pods: 110 System Info: Machine ID: 7a0b146b48a04e97b85ba66c86c573eb System UUID: 877bb289-7c3c-d63f-bea8-3c7c3fd6bea7 Boot ID: d76aad10-ad90-4982-995f-b4fdfa6ad6c6 Kernel Version: 5.10.81+truenas OS Image: Debian GNU/Linux 11 (bullseye) Operating System: linux Architecture: amd64 Container Runtime Version: docker://20.10.9 Kubelet Version: v1.21.0-k3s1 Kube-Proxy Version: v1.21.0-k3s1 PodCIDR: 172.16.0.0/16 PodCIDRs: 172.16.0.0/16 Non-terminated Pods: (32 in total)
Checking the nvidia-device-plugin-daemonset pod:
Code:
root@odin[~]# k3s kubectl -n kube-system describe pod nvidia-device-plugin-daemonset-p2m6t Name: nvidia-device-plugin-daemonset-p2m6t Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Node: ix-truenas/10.8.30.20 Start Time: Wed, 01 Dec 2021 10:13:28 +0100 Labels: controller-revision-hash=586f5fbcf9 name=nvidia-device-plugin-ds pod-template-generation=2 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "ix-net", "interface": "eth0", "ips": [ "172.16.0.28" ], "mac": "86:ff:c1:23:71:95", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "ix-net", "interface": "eth0", "ips": [ "172.16.0.28" ], "mac": "86:ff:c1:23:71:95", "default": true, "dns": {} }] scheduler.alpha.kubernetes.io/critical-pod: Status: Running IP: 172.16.0.28 IPs: IP: 172.16.0.28 Controlled By: DaemonSet/nvidia-device-plugin-daemonset Containers: nvidia-device-plugin-ctr: Container ID: docker://96dd301781c60d929c0faf0093a004dc77b727d801a68e53e303d9d6c475b862 Image: nvidia/k8s-device-plugin:v0.9.0 Image ID: docker-pullable://nvidia/k8s-device-plugin@sha256:964847cc3fd85ead286be1d74d961f53d638cd4875af51166178b17bba90192f Port: <none> Host Port: <none> Args: --fail-on-init-error=false State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: ContainerCannotRun Message: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown Exit Code: 128 Started: Wed, 01 Dec 2021 10:48:15 +0100 Finished: Wed, 01 Dec 2021 10:48:15 +0100 Ready: False Restart Count: 13 Environment: DP_DISABLE_HEALTHCHECKS: xids Mounts: /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5qhn (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: kube-api-access-r5qhn: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: <none> Tolerations: CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 39m default-scheduler Successfully assigned kube-system/nvidia-device-plugin-daemonset-p2m6t to ix-truenas Normal AddedInterface 38m multus Add eth0 [172.16.0.28/16] from ix-net Normal Created 34m (x6 over 38m) kubelet Created container nvidia-device-plugin-ctr Warning Failed 34m (x6 over 37m) kubelet Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown Normal Pulled 33m (x7 over 38m) kubelet Container image "nvidia/k8s-device-plugin:v0.9.0" already present on machine Warning BackOff 3m28s (x141 over 36m) kubelet Back-off restarting failed container
Trying to run nvidia-smi:
Code:
root@odin[~]# nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
lspci:
Code:
root@odin[~]# lspci | grep -i vga 0a:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) 0b:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
As the above suggests, the system is not able to load the nvidia driver. I came across another thread and attempted manually installing the driver with apt install nvidia-cuda-dev nvidia-cuda-toolkit, which almost bricked the system.
Any suggestions or help would be highly appreciated!