djjudas21
Cadet
- Joined
- Oct 21, 2022
- Messages
- 5
Hi all. I'm using a TrueNAS box to provide storage for my homelab Kubernetes cluster. I'm seeing wildly variable performance, especially on iSCSI, which is causing problems for my apps, and I don't know how to troubleshoot storage.
The NAS hardware is modest but adequate. I'm using an HP MicroServer N40L, with 16GB RAM. Currently running TrueNAS 13.0-U2, but this was also observed on various version of 12. There are 2 pools, each with a single vdev:
The Kube cluster is 3 physical nodes, quad core i5, 16GB RAM, running Ubuntu 22 LTS with MicroK8s. All nodes are on the same gigabit LAN as the NAS.
I'm using Democratic CSI to access the storage pools as Kubernetes StorageClasses:
This arrangement has been working great for me for ages. Most of my PersistentVolumes are on NFS, performance is not amazing but perfectly fine for my needs. The only PersistentVolumes on iSCSI are databases. I have about 12 PostgreSQL and MariaDB instances but none of them are too busy.
However, I've noticed that sometimes that the databases get unhappy because write latency gets too high (i.e. multiple seconds). I don't see any actual errors from Kubernetes or TrueNAS, just that my MariaDB pods restart because of I/O troubles. So I ran some benchmarks to try and see what's going on and get some actual numbers.
I used Kubestr to run the benchmarks, which is just a Kubernetes wrapper around fio which creates a PersistentVolume on a StorageClass of your choosing, then runs fio in a pod.
Benchmark for NFS:
Benchmark for iSCSI:
The average IOPS and bandwidth seem fine for both NFS and iSCSI, but look at the minimum values. Some of the iSCSI tests show 1 IOPS, but I believe the real number is less than 1 and has been rounded. Check out the time elapsed for the tests - just over 1 minute for NFS but 15 minutes for iSCSI! So it is clearly getting blocked sometimes.
Kubernetes isn't emitting any relevant logs. Systemctl on the kube nodes shows iSCSI volumes being mounted successfully but doesn't log anything useful after that. TrueNAS doesn't show anything useful in its dashboard, other than to confirm that CPU usage is quite low and the network interface isn't anywhere near maxed out. The pools are healthy.
I haven't got any caching/metadata devices set up. I could experiment with this, but the lack of a cache alone doesn't explain the poor performance and high latency. My databases are only doing tens of queries per second and even a single consumer grade SSD or HDD would have no problems keeping up with that. Obviously I know this hardware is not high spec, but it's also not maxed out and I can't find an explanation for the intermittent terrible performance.
I'm going to set up a Syslog server so I can observe log messages, but other than this, can anyone suggest any additional troubleshooting steps?
TrueNAS setup
The NAS hardware is modest but adequate. I'm using an HP MicroServer N40L, with 16GB RAM. Currently running TrueNAS 13.0-U2, but this was also observed on various version of 12. There are 2 pools, each with a single vdev:
- 4x3TB HDD in RAIDZ2, served as NFS
- 2x256GB SSD in Mirror, served as iSCSI
Kubernetes cluster
The Kube cluster is 3 physical nodes, quad core i5, 16GB RAM, running Ubuntu 22 LTS with MicroK8s. All nodes are on the same gigabit LAN as the NAS.
I'm using Democratic CSI to access the storage pools as Kubernetes StorageClasses:
- freenas-nfs which provisions PersistentVolumes on the HDD NFS pool as ReadWriteMany file storage
- freenas-iscsi which provisions PersistentVolumes on the SSD iSCSI pool as ReadWriteOnce block storage
Workloads
This arrangement has been working great for me for ages. Most of my PersistentVolumes are on NFS, performance is not amazing but perfectly fine for my needs. The only PersistentVolumes on iSCSI are databases. I have about 12 PostgreSQL and MariaDB instances but none of them are too busy.
However, I've noticed that sometimes that the databases get unhappy because write latency gets too high (i.e. multiple seconds). I don't see any actual errors from Kubernetes or TrueNAS, just that my MariaDB pods restart because of I/O troubles. So I ran some benchmarks to try and see what's going on and get some actual numbers.
Benchmarks
I used Kubestr to run the benchmarks, which is just a Kubernetes wrapper around fio which creates a PersistentVolume on a StorageClass of your choosing, then runs fio in a pod.
Benchmark for NFS:
Code:
[jonathan@poseidon ~]$ ./kubestr fio -s freenas-nfs --size 10Gi PVC created kubestr-fio-pvc-f5pfw Pod created kubestr-fio-pod-hdnkj Running FIO test (default-fio) on StorageClass (freenas-nfs) with a PVC of Size (10Gi) Elapsed time- 1m20.723691883s FIO test results: FIO version - fio-3.30 Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1 JobName: read_iops blocksize=4K filesize=2G iodepth=64 rw=randread read: IOPS=54.494152 BW(KiB/s)=232 iops: min=11 max=111 avg=56.437500 bw(KiB/s): min=47 max=445 avg=227.031250 JobName: write_iops blocksize=4K filesize=2G iodepth=64 rw=randwrite write: IOPS=51.373466 BW(KiB/s)=220 iops: min=20 max=99 avg=53.406250 bw(KiB/s): min=80 max=397 avg=214.968750 JobName: read_bw blocksize=128K filesize=2G iodepth=64 rw=randread read: IOPS=54.779305 BW(KiB/s)=7483 iops: min=20 max=109 avg=57.031250 bw(KiB/s): min=2569 max=13996 avg=7352.906250 JobName: write_bw blocksize=128k filesize=2G iodepth=64 rw=randwrite write: IOPS=50.093323 BW(KiB/s)=6897 iops: min=1 max=111 avg=52.451614 bw(KiB/s): min=246 max=14278 avg=6767.387207 Disk stats (read/write): - OK
Benchmark for iSCSI:
Code:
[jonathan@poseidon ~]$ ./kubestr fio -s freenas-iscsi --size 10Gi PVC created kubestr-fio-pvc-b7jcd Pod created kubestr-fio-pod-nnfjk Running FIO test (default-fio) on StorageClass (freenas-iscsi) with a PVC of Size (10Gi) Elapsed time- 15m44.950939461s FIO test results: FIO version - fio-3.30 Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1 JobName: read_iops blocksize=4K filesize=2G iodepth=64 rw=randread read: IOPS=218.332718 BW(KiB/s)=886 iops: min=7 max=793 avg=274.620697 bw(KiB/s): min=31 max=3175 avg=1100.310303 JobName: write_iops blocksize=4K filesize=2G iodepth=64 rw=randwrite write: IOPS=69.734314 BW(KiB/s)=287 iops: min=1 max=480 avg=124.833336 bw(KiB/s): min=7 max=1920 avg=501.333344 JobName: read_bw blocksize=128K filesize=2G iodepth=64 rw=randread read: IOPS=184.131027 BW(KiB/s)=23996 iops: min=13 max=628 avg=216.193542 bw(KiB/s): min=1781 max=80384 avg=27733.935547 JobName: write_bw blocksize=128k filesize=2G iodepth=64 rw=randwrite write: IOPS=78.984238 BW(KiB/s)=10392 iops: min=3 max=414 avg=125.400002 bw(KiB/s): min=508 max=52992 avg=16128.542969 Disk stats (read/write): sdl: ios=9948/5599 merge=32/29 ticks=1701348/2836821 in_queue=4567574, util=96.600540% - OK
The average IOPS and bandwidth seem fine for both NFS and iSCSI, but look at the minimum values. Some of the iSCSI tests show 1 IOPS, but I believe the real number is less than 1 and has been rounded. Check out the time elapsed for the tests - just over 1 minute for NFS but 15 minutes for iSCSI! So it is clearly getting blocked sometimes.
Troubleshooting
Kubernetes isn't emitting any relevant logs. Systemctl on the kube nodes shows iSCSI volumes being mounted successfully but doesn't log anything useful after that. TrueNAS doesn't show anything useful in its dashboard, other than to confirm that CPU usage is quite low and the network interface isn't anywhere near maxed out. The pools are healthy.
I haven't got any caching/metadata devices set up. I could experiment with this, but the lack of a cache alone doesn't explain the poor performance and high latency. My databases are only doing tens of queries per second and even a single consumer grade SSD or HDD would have no problems keeping up with that. Obviously I know this hardware is not high spec, but it's also not maxed out and I can't find an explanation for the intermittent terrible performance.
I'm going to set up a Syslog server so I can observe log messages, but other than this, can anyone suggest any additional troubleshooting steps?