Highly variable latency and latency problems with Kubernetes

djjudas21

Cadet
Joined
Oct 21, 2022
Messages
5
Hi all. I'm using a TrueNAS box to provide storage for my homelab Kubernetes cluster. I'm seeing wildly variable performance, especially on iSCSI, which is causing problems for my apps, and I don't know how to troubleshoot storage.

TrueNAS setup​


The NAS hardware is modest but adequate. I'm using an HP MicroServer N40L, with 16GB RAM. Currently running TrueNAS 13.0-U2, but this was also observed on various version of 12. There are 2 pools, each with a single vdev:
  • 4x3TB HDD in RAIDZ2, served as NFS
  • 2x256GB SSD in Mirror, served as iSCSI

Kubernetes cluster​


The Kube cluster is 3 physical nodes, quad core i5, 16GB RAM, running Ubuntu 22 LTS with MicroK8s. All nodes are on the same gigabit LAN as the NAS.

I'm using Democratic CSI to access the storage pools as Kubernetes StorageClasses:
  • freenas-nfs which provisions PersistentVolumes on the HDD NFS pool as ReadWriteMany file storage
  • freenas-iscsi which provisions PersistentVolumes on the SSD iSCSI pool as ReadWriteOnce block storage

Workloads​


This arrangement has been working great for me for ages. Most of my PersistentVolumes are on NFS, performance is not amazing but perfectly fine for my needs. The only PersistentVolumes on iSCSI are databases. I have about 12 PostgreSQL and MariaDB instances but none of them are too busy.

However, I've noticed that sometimes that the databases get unhappy because write latency gets too high (i.e. multiple seconds). I don't see any actual errors from Kubernetes or TrueNAS, just that my MariaDB pods restart because of I/O troubles. So I ran some benchmarks to try and see what's going on and get some actual numbers.

Benchmarks​


I used Kubestr to run the benchmarks, which is just a Kubernetes wrapper around fio which creates a PersistentVolume on a StorageClass of your choosing, then runs fio in a pod.

Benchmark for NFS:
Code:
[jonathan@poseidon ~]$ ./kubestr fio -s freenas-nfs --size 10Gi
PVC created kubestr-fio-pvc-f5pfw
Pod created kubestr-fio-pod-hdnkj
Running FIO test (default-fio) on StorageClass (freenas-nfs) with a PVC of Size (10Gi)
Elapsed time- 1m20.723691883s
FIO test results:
 
FIO version - fio-3.30
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops
  blocksize=4K filesize=2G iodepth=64 rw=randread
read:
  IOPS=54.494152 BW(KiB/s)=232
  iops: min=11 max=111 avg=56.437500
  bw(KiB/s): min=47 max=445 avg=227.031250

JobName: write_iops
  blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=51.373466 BW(KiB/s)=220
  iops: min=20 max=99 avg=53.406250
  bw(KiB/s): min=80 max=397 avg=214.968750

JobName: read_bw
  blocksize=128K filesize=2G iodepth=64 rw=randread
read:
  IOPS=54.779305 BW(KiB/s)=7483
  iops: min=20 max=109 avg=57.031250
  bw(KiB/s): min=2569 max=13996 avg=7352.906250

JobName: write_bw
  blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=50.093323 BW(KiB/s)=6897
  iops: min=1 max=111 avg=52.451614
  bw(KiB/s): min=246 max=14278 avg=6767.387207

Disk stats (read/write):
  -  OK


Benchmark for iSCSI:
Code:
[jonathan@poseidon ~]$ ./kubestr fio -s freenas-iscsi --size 10Gi
PVC created kubestr-fio-pvc-b7jcd
Pod created kubestr-fio-pod-nnfjk
Running FIO test (default-fio) on StorageClass (freenas-iscsi) with a PVC of Size (10Gi)
Elapsed time- 15m44.950939461s
FIO test results:
  
FIO version - fio-3.30
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops
  blocksize=4K filesize=2G iodepth=64 rw=randread
read:
  IOPS=218.332718 BW(KiB/s)=886
  iops: min=7 max=793 avg=274.620697
  bw(KiB/s): min=31 max=3175 avg=1100.310303

JobName: write_iops
  blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=69.734314 BW(KiB/s)=287
  iops: min=1 max=480 avg=124.833336
  bw(KiB/s): min=7 max=1920 avg=501.333344

JobName: read_bw
  blocksize=128K filesize=2G iodepth=64 rw=randread
read:
  IOPS=184.131027 BW(KiB/s)=23996
  iops: min=13 max=628 avg=216.193542
  bw(KiB/s): min=1781 max=80384 avg=27733.935547

JobName: write_bw
  blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=78.984238 BW(KiB/s)=10392
  iops: min=3 max=414 avg=125.400002
  bw(KiB/s): min=508 max=52992 avg=16128.542969

Disk stats (read/write):
  sdl: ios=9948/5599 merge=32/29 ticks=1701348/2836821 in_queue=4567574, util=96.600540%
  -  OK


The average IOPS and bandwidth seem fine for both NFS and iSCSI, but look at the minimum values. Some of the iSCSI tests show 1 IOPS, but I believe the real number is less than 1 and has been rounded. Check out the time elapsed for the tests - just over 1 minute for NFS but 15 minutes for iSCSI! So it is clearly getting blocked sometimes.

Troubleshooting​


Kubernetes isn't emitting any relevant logs. Systemctl on the kube nodes shows iSCSI volumes being mounted successfully but doesn't log anything useful after that. TrueNAS doesn't show anything useful in its dashboard, other than to confirm that CPU usage is quite low and the network interface isn't anywhere near maxed out. The pools are healthy.

I haven't got any caching/metadata devices set up. I could experiment with this, but the lack of a cache alone doesn't explain the poor performance and high latency. My databases are only doing tens of queries per second and even a single consumer grade SSD or HDD would have no problems keeping up with that. Obviously I know this hardware is not high spec, but it's also not maxed out and I can't find an explanation for the intermittent terrible performance.

I'm going to set up a Syslog server so I can observe log messages, but other than this, can anyone suggest any additional troubleshooting steps?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The NAS hardware is modest but adequate.

Adequate? Quite possibly not. It's hard to know exactly what's going on here, but iSCSI isn't recommended until you have 64GB of RAM or more, and if you're running multiple pools, that really means "64GB of RAM for the iSCSI pool". Some of the reasoning for this is described at


and in your particular case some of that may not be as relevant if you're doing pure database storage, but it is certainly not dismissible in the offhand way quoted. Additional information is available at


With only 256GB SSD for the iSCSI datastore, I also have the uncomfortable feeling that you may have used the vast majority of this. What's the actual size of the iSCSI device you're sharing out? How full is this?

One of the things that happens with SSD's is that due to the immense amount of fragmentation, ZFS can easily run the free page pool out of new blocks on an SSD due to the way SSD management works internally (I suggest the 2012 Ars Technica article as a primer if this doesn't make sense to you). The mistake that users most commonly make with ZFS is to try to make use of all their space, when in reality you should probably hold utilization down to less than 50% if you want the best chance at performance.
 

djjudas21

Cadet
Joined
Oct 21, 2022
Messages
5
Thanks for the info @jgreco. All of my databases are small (Wordpress blogs and that kind of thing), each database PersistentVolume is 4-8Gi, the SSD/iSCSI datastore is much less than 50% allocated and even less than that actually used (most of the databases are only 10s of MB in size).

Typically my TrueNAS instance uses about 4GB memory for "services" and about 12GB for the ZFS cache, leaving a few hundred MB free.

I can't understand the root problem though. The benchmarks show that my iSCSI volumes *can* run at a healthy rate of IOPS and bandwidth most of the time, but then suddenly take multiple seconds to handle an SQL insert. I wouldn't expect such variation if it was just that my NAS box was underspecced for the workload.

I know running databases on NFS is usually an anti-pattern, but at my small scale, do you think I'd be likely to get better performance if I moved my database volumes to NFS, freed up the SSDs, and added them to the NFS pool as cache/metadata devices etc?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I can't understand the root problem though. The benchmarks show that my iSCSI volumes *can* run at a healthy rate of IOPS and bandwidth most of the time, but then suddenly take multiple seconds to handle an SQL insert. I wouldn't expect such variation if it was just that my NAS box was underspecced for the workload.

Well, that's what I was taking a shot at. My offhand guess is/was that SSD's can run out of their free page pools, at which point, every write involves an erasure cycle along with remappings, which can be a very inefficient process. Worse, ZFS has a tendency to need to write multiple things per block storage I/O (such as metadata) -- you may think that you're just writing a 4KB sector, but in reality this may be involving a bunch of read I/O of metadata to find a "free" sector on the SSD (which is one of the reasons we advise GOBS of RAM for iSCSI), then a write of a new block (because ZFS is a CoW system), then writes of the supporting metadata. Especially if you are using consumer grade SSD's, they may simply not be able to cope with this abuse.

I know running databases on NFS is usually an anti-pattern, but at my small scale, do you think I'd be likely to get better performance if I moved my database volumes to NFS, freed up the SSDs, and added them to the NFS pool as cache/metadata devices etc?

Well, I can tell you that you won't likely be successful trying to use them as L2ARC. The typical ratio of L2ARC:ARC is 5:1, and exceeding 10:1 is not recommended. If you only have 12GB ARC, then you should not attempt to exceed 120GB L2ARC, and if you do any L2ARC at all, I will note that the amount of ARC present is likely not large enough to collect meaningful MRU/MFU statistics to generate good eviction choices. It is likely to mostly be caching semi-random crap (from my perspective).

I don't get a real good feeling that there is a path forward to a reliable, performant system here. I do like the N40L's; we have one here as a replication target for the office filer, and aside from blowing its brains one day with a PSU failure, it's been a very pleasant little unit. But scope and scale matters. It will never be a highly performant system. I believe it only has two cores and a bunch of other tradeoffs.
 

djjudas21

Cadet
Joined
Oct 21, 2022
Messages
5
It will never be a highly performant system. I believe it only has two cores and a bunch of other tradeoffs.
Agreed.

It's just a bit concerning/disappointing for me that running my databases on iSCSI -> SSD mirror is offering seemingly worse performance than just letting the databases stick their files on the single host SSD of the Kubernetes node (not taking reliability into consideration). And it taking multiple seconds for the occasional SQL commit seems excessive even for a consumer-grade SSDs in a low-spec machine.

I'll do some experiments with SQL on NFS and see if it's any more stable.

In the meantime, thanks for your help.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
One of the weird things about ZFS that often takes people unawares is that ZFS uses substantially greater resources to achieve a similar result to a conventional solution.

Part of this is due to ZFS being a copy-on-write filesystem with sophisticated features such as snapshots and compression. Snapshots require additional storage and CoW suggests anti-fragmentation mitigations need to be considered. Compression has unexpected interactions with the CPU, especially on a low-GHz two-core system. You can have several things demanding cores simultaneously, and in your case you're also servicing two pools, so you can have even more things demanding resources simultaneously.

My experience, and that of many other ZFS users, is that we trade off resource consumption for performance and features. ZFS uses an unconscionable amount of CPU and memory, and it is not unusual for under-dimensioned systems to fail to perform adequately. It could especially be significant that your observed performance is so inconsistent in a negative manner.
 

djjudas21

Cadet
Joined
Oct 21, 2022
Messages
5
Just to follow up on this, there is a more fundamental problem with my TrueNAS system that I need to solve before I worry about iSCSI.

I decided to run fio on both storage pools to verify that the storage was OK. The HDD pool (4x3TB HDD in RAIDZ2) performed roughly as I would expect, but the SSD pool (2x240GB SSD in a mirror) showed terrible performance. These benchmarks were run against the mounted ZFS filesystems rather than the block devices.

I can't say I understand the IOPS figures (12k IOPS ?!) but as a general idea of performance, the HDD pool managed around 35MB/s while the SSDs managed just 6MB/s in a random write test.

I assume it's not possible to run write tests against the underlying block devices while they are mounted?

None of the drives are showing SMART errors, but I'm going to start checking out the drives and the SATA controller.

Code:
root@nas[~]# fio --name=randwrite --iodepth=1 --rw=randwrite --bs=4k --direct=1 --size=1G --runtime=240 --directory=/mnt/hdd
randwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][96.8%][w=47.9MiB/s][w=12.3k IOPS][eta 00m:01s]
randwrite: (groupid=0, jobs=1): err= 0: pid=95901: Wed Oct 26 21:17:08 2022
  write: IOPS=8933, BW=34.9MiB/s (36.6MB/s)(1024MiB/29343msec); 0 zone resets
    clat (usec): min=14, max=308992, avg=108.61, stdev=878.29
     lat (usec): min=14, max=308992, avg=109.02, stdev=878.30
    clat percentiles (usec):
     |  1.00th=[   17],  5.00th=[   18], 10.00th=[   19], 20.00th=[   20],
     | 30.00th=[   21], 40.00th=[   22], 50.00th=[   23], 60.00th=[   38],
     | 70.00th=[  137], 80.00th=[  178], 90.00th=[  235], 95.00th=[  262],
     | 99.00th=[ 1156], 99.50th=[ 1221], 99.90th=[ 2180], 99.95th=[ 3064],
     | 99.99th=[10421]
   bw (  KiB/s): min= 3402, max=115417, per=97.94%, avg=34999.81, stdev=29208.38, samples=58
   iops        : min=  850, max=28854, avg=8749.55, stdev=7302.09, samples=58
  lat (usec)   : 20=23.39%, 50=38.15%, 100=2.42%, 250=29.64%, 500=4.56%
  lat (usec)   : 750=0.14%, 1000=0.07%
  lat (msec)   : 2=1.52%, 4=0.09%, 10=0.02%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%
  cpu          : usr=4.64%, sys=34.76%, ctx=137571, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=34.9MiB/s (36.6MB/s), 34.9MiB/s-34.9MiB/s (36.6MB/s-36.6MB/s), io=1024MiB (1074MB), run=29343-29343msec



Code:
root@nas[~]# fio --name=randwrite --iodepth=1 --rw=randwrite --bs=4k --direct=1 --size=1G --runtime=240 --directory=/mnt/ssd
randwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.28
Starting 1 process
randwrite: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=9349KiB/s][w=2337 IOPS][eta 00m:00s]
randwrite: (groupid=0, jobs=1): err= 0: pid=95943: Wed Oct 26 21:20:30 2022
  write: IOPS=1512, BW=6051KiB/s (6196kB/s)(1024MiB/173282msec); 0 zone resets
    clat (usec): min=15, max=46355, avg=648.62, stdev=479.26
     lat (usec): min=15, max=46356, avg=649.73, stdev=479.44
    clat percentiles (usec):
     |  1.00th=[   18],  5.00th=[   21], 10.00th=[  116], 20.00th=[  359],
     | 30.00th=[  433], 40.00th=[  474], 50.00th=[  537], 60.00th=[  652],
     | 70.00th=[  783], 80.00th=[  988], 90.00th=[ 1303], 95.00th=[ 1532],
     | 99.00th=[ 1729], 99.50th=[ 1745], 99.90th=[ 2442], 99.95th=[ 2802],
     | 99.99th=[ 6259]
   bw (  KiB/s): min= 2210, max=101153, per=100.00%, avg=6052.88, stdev=6935.82, samples=345
   iops        : min=  552, max=25288, avg=1512.91, stdev=1733.96, samples=345
  lat (usec)   : 20=4.26%, 50=2.51%, 100=2.43%, 250=4.11%, 500=32.53%
  lat (usec)   : 750=21.65%, 1000=14.11%
  lat (msec)   : 2=18.27%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=3.14%, sys=10.71%, ctx=240041, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=6051KiB/s (6196kB/s), 6051KiB/s-6051KiB/s (6196kB/s-6196kB/s), io=1024MiB (1074MB), run=173282-173282msec
 
Top