I think setting MTU=9000 on all systems has helped, I'm not getting any more stalls. But NFS throughput is not great. Testing with fio, it averages about 400 MiBps read and 550MiBps write. The write seems to be bottlenecked by the server CPU, and the kernel is using more power than nfsd. Is there something I need to adjust for this? I have set NFS server to run 12 threads. The server uses Xeon E5-2620 v3 with 6 cores and 64GB RAM.
These two machines are connected straight together with Mellanox 40G. iperf3 shows 28Gbps single stream, or 40Gbps with -P2 streams.
I tried mounting the client with larger rsize and wsize but it seems like 128K is the maximum allowed on CentOS.
Code:
$ mount | grep bignas
192.168.2.194:/mnt/pool/pawnee on /mnt/bignas type nfs4 (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.2.50,local_lock=none,addr=192.168.2.194)
Running directly on the server, read speed is over 3000MiBps (because it uses ARC). Write is about 1000MiBps and CPU usage is still close to 100%. Is this high CPU usage normal for ZFS because of all the checksumming, etc.?
Write test:
Code:
$ fio --name=larry --rw=write --bs=1M --size=10G --directory=/mnt/bignas --end_fsync=1
larry: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
larry: (groupid=0, jobs=1): err= 0: pid=311124: Mon Sep 30 18:10:40 2019
write: IOPS=546, BW=546MiB/s (573MB/s)(10.0GiB/18750msec)
clat (usec): min=369, max=496, avg=382.20, stdev=10.00
lat (usec): min=377, max=625, avg=394.19, stdev=10.74
clat percentiles (usec):
| 1.00th=[ 375], 5.00th=[ 375], 10.00th=[ 375], 20.00th=[ 379],
| 30.00th=[ 379], 40.00th=[ 379], 50.00th=[ 379], 60.00th=[ 383],
| 70.00th=[ 383], 80.00th=[ 383], 90.00th=[ 388], 95.00th=[ 400],
| 99.00th=[ 412], 99.50th=[ 469], 99.90th=[ 482], 99.95th=[ 486],
| 99.99th=[ 490]
bw ( MiB/s): min= 2456, max= 2550, per=100.00%, avg=2532.11, stdev=31.49, samples=8
iops : min= 2456, max= 2550, avg=2532.00, stdev=31.46, samples=8
lat (usec) : 500=100.00%
cpu : usr=0.75%, sys=32.98%, ctx=217020, majf=0, minf=579
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,10240,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=546MiB/s (573MB/s), 546MiB/s-546MiB/s (573MB/s-573MB/s), io=10.0GiB (10.7GB), run=18750-18750msec
Read test:
Code:
$ fio --name=joe --rw=read --bs=1M --directory=/mnt/bignas --size=30G
joe: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=439MiB/s,w=0KiB/s][r=439,w=0 IOPS][eta 00m:00s]
joe: (groupid=0, jobs=1): err= 0: pid=310816: Mon Sep 30 18:08:53 2019
read: IOPS=406, BW=407MiB/s (427MB/s)(30.0GiB/75484msec)
clat (usec): min=100, max=165236, avg=2456.42, stdev=8027.44
lat (usec): min=100, max=165236, avg=2456.48, stdev=8027.45
clat percentiles (usec):
| 1.00th=[ 109], 5.00th=[ 114], 10.00th=[ 117], 20.00th=[ 125],
| 30.00th=[ 212], 40.00th=[ 392], 50.00th=[ 562], 60.00th=[ 717],
| 70.00th=[ 832], 80.00th=[ 1090], 90.00th=[ 4293], 95.00th=[ 9503],
| 99.00th=[ 50070], 99.50th=[ 55837], 99.90th=[ 65799], 99.95th=[ 70779],
| 99.99th=[103285]
bw ( KiB/s): min=137216, max=804864, per=99.95%, avg=416524.73, stdev=157154.73, samples=150
iops : min= 134, max= 786, avg=406.72, stdev=153.46, samples=150
lat (usec) : 250=32.09%, 500=14.48%, 750=18.08%, 1000=12.80%
lat (msec) : 2=9.29%, 4=2.94%, 10=5.63%, 20=1.95%, 50=1.68%
lat (msec) : 100=1.05%, 250=0.01%
cpu : usr=0.06%, sys=11.67%, ctx=81644, majf=0, minf=1084
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=30720,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=30.0GiB (32.2GB), run=75484-75484msec