It's back. Going to need some help here as I believe the issue is truenas.
lspci -nv (shows the 4 drives are in the same locations but the kernel driver in use for 6 is missing)
05:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 0, IOMMU group 32
Memory at 92600000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme
06:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: fast devsel, IRQ 47, NUMA node 0, IOMMU group 33
Memory at 92500000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel modules: nvme
07:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 49, NUMA node 0, IOMMU group 34
Memory at 92400000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme
08:00.0 0108: 144d:a80c (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0, IRQ 51, NUMA node 0, IOMMU group 35
Memory at 92300000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: nvme
Kernel modules: nvme
Inside the VM this time the pass through is still active for the coral
00:08.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a
It also appears the pass through is still active for the GPU.
00:09.0 0300: 10de:1cb2 (rev a1) (prog-if 00 [VGA controller])
Subsystem: 10de:11bd
Checkmk logged 3 of the 4 drives dropping out 22 hours ago.
At this same time home assistant stops logging CPU shortterm load via the truenas integration but SNMP is still active.
There are no errors in SNMP from checkmk going to the Dell Server.
zpool status -v
pool: VM_NVME
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
To note the removed drive is the same drive that was being removed previoulsy via the identification numbers. it's possible the drive could be acting up but with the loss to checkmk and home assistant via truenas SNMP.... not sure I'm ready to make that conclusion.
I will note.... that when it dropped out CPU usage did spike as was captured on checkMK. I'm wondering if it's possible CPU load was too high and it dropped a drive? ignore the time stamp as the system is on the wrong time scale.
Wellllll..... welllllll.... wellll..... Dec 30th drop out
Starting to wonder if it's kicking things out when CPU load is too high