Will it do a million IOPS

Donny Davis

Contributor
Joined
Jul 31, 2015
Messages
139
So in my latest FreeNAS endeavor I am trying to hit a million IOPS.

My machine is as follows
Dell Poweredge T620
2X Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
192GB of DDR3 memory (ECC)
4 Intel DC P3600 1.6TB NVME Drives in a 4 way stripe

So far I am quite a clip away from 1M IOPS - However under Linux with MDADM I can hit the number I am looking for so I am pretty sure its not the gear.

I am sure FreeNAS and ZFS are quite capable and I have it misconfigured.

It could be my testing method, my configuration, or both. Any pointers on how to push this to the max would be much appreciated.

So far I am at 100K IOPS using a 4k test with fio

Code:

fio --filename=test --direct=1 --rw=randrw  --randrepeat=0 --rwmixread=0--iodepth=16 --numjobs=32 --runtime=60 --group_reporting --name=4ktest  --size=4G --bs=4k

4ktest: (groupid=0, jobs=32): err= 0: pid=11832: Mon Dec 30 19:45:12 2019
  write: IOPS=102k, BW=397MiB/s (416MB/s)(23.2GiB/60010msec)
    clat (usec): min=13, max=81419, avg=310.83, stdev=620.32
     lat (usec): min=13, max=81420, avg=311.27, stdev=620.54
    clat percentiles (usec):
     |  1.00th=[   37],  5.00th=[   52], 10.00th=[   68], 20.00th=[   90],
     | 30.00th=[  102], 40.00th=[  116], 50.00th=[  137], 60.00th=[  165],
     | 70.00th=[  223], 80.00th=[  371], 90.00th=[  758], 95.00th=[ 1172],
     | 99.00th=[ 2311], 99.50th=[ 2966], 99.90th=[ 5932], 99.95th=[ 9110],
     | 99.99th=[20317]
   bw (  KiB/s): min= 2525, max=30773, per=3.13%, avg=12708.65, stdev=5502.21, samples=3831
   iops        : min=  631, max= 7693, avg=3176.88, stdev=1375.55, samples=3831
  lat (usec)   : 20=0.01%, 50=4.35%, 100=24.07%, 250=44.31%, 500=11.43%
  lat (usec)   : 750=5.63%, 1000=3.67%
  lat (msec)   : 2=5.13%, 4=1.18%, 10=0.18%, 20=0.03%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=1.69%, sys=25.15%, ctx=9471830, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,6094225,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Just for giggles I bumped up the block size so I can see some throughput numbers
write: IOPS=15.9k, BW=7931MiB/s (8316MB/s)

I guess I can live with 8G/s

Code:
4ktest: (groupid=0, jobs=32): err= 0: pid=12389: Mon Dec 30 19:49:32 2019
  write: IOPS=15.9k, BW=7931MiB/s (8316MB/s)(128GiB/16527msec)
    clat (usec): min=107, max=76720, avg=1901.41, stdev=2120.80
     lat (usec): min=114, max=76824, avg=1993.77, stdev=2127.88
    clat percentiles (usec):
     |  1.00th=[  343],  5.00th=[  668], 10.00th=[  775], 20.00th=[  930],
     | 30.00th=[ 1074], 40.00th=[ 1205], 50.00th=[ 1336], 60.00th=[ 1500],
     | 70.00th=[ 1713], 80.00th=[ 2212], 90.00th=[ 3523], 95.00th=[ 5080],
     | 99.00th=[10421], 99.50th=[14091], 99.90th=[25035], 99.95th=[30278],
     | 99.99th=[42730]
   bw (  KiB/s): min=131334, max=453632, per=3.14%, avg=255016.27, stdev=72851.62, samples=1024
   iops        : min=  256, max=  886, avg=497.70, stdev=142.32, samples=1024
  lat (usec)   : 250=0.67%, 500=1.19%, 750=6.69%, 1000=16.64%
  lat (msec)   : 2=51.99%, 4=14.81%, 10=6.92%, 20=0.89%, 50=0.21%
  lat (msec)   : 100=0.01%
  cpu          : usr=5.57%, sys=45.56%, ctx=1422776, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Last edited:

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
million iops? did you disable synchronous writes on your pool/s? Also under Linux you say you can hit it..Linux is not as interested in data safety as ZFS is. https://hescominsoon.com/archives/6319

Other than redoing the pool into nothing but striping with no fault tolerance and disabling synchronous writes I am sure there's other things you could do to hit that mark...but FreeNAS via ZFS is more concerned about data integrity. I would like to see the full suite of commands you ran under Linux to get that 1 million IOPs because unless shown otherwise if that was run after other tests it's more likely the product of ram caching and not the disks themselves. yes accorsing to the sepcs the drives are UP TO 250k iops..but that's under ideal conditions under controlled testing.
 

Donny Davis

Contributor
Joined
Jul 31, 2015
Messages
139
Thank you for the reply @hescominsoon

I forgot to add the original read test - and that is where I am trying to hit 1M IOPS.

This is what I am using on linux to get what I am looking for in terms of read performance. This is coming from memory because I flush the cache between tests.. When it is in memory its getting around 22G/s which makes sense.

Code:
fio --filename=test --direct=1 --rw=randrw  --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --ioengine=libaio --size=4G --bs=4k


4ktest: (groupid=0, jobs=24): err= 0: pid=28070: Tue Dec 31 14:26:31 2019
   read: IOPS=1481k, BW=5786MiB/s (6067MB/s)(96.0GiB/16991msec)


I must have been borking something up the first time around because now I am much closer to hitting the number I am looking for. Here is where I am currently at. 630K IOPS

Code:
fio --filename=test --direct=1 --rw=randrw  --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --ioengine=psync --size=4G --bs=4k
4ktest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=128
...
fio-3.5
Starting 16 processes
4ktest: Laying out IO file (1 file / 4096MiB)
Jobs: 7 (f=7): [_(3),r(2),_(1),r(1),_(1),r(2),E(1),r(1),_(1),r(1),_(2)][90.0%][r=2547MiB/s,w=0KiB/s][r=652k,w=0 IOPS][eta 00m:03s]
4ktest: (groupid=0, jobs=16): err= 0: pid=24420: Thu Jan  2 11:23:22 2020
   read: IOPS=628k, BW=2452MiB/s (2571MB/s)(64.0GiB/26725msec)
 

Donny Davis

Contributor
Joined
Jul 31, 2015
Messages
139
I have been checking the ioengine options in fio. It would seem that I am hitting a CPU bottleneck with FIO. All threads hit 100% CPU and stay there, so maybe the psync / posixaio engine's aren't fast enough to keep up with this array.

It doesn't matter if I stripe / raidz / RAID10 this array - 630K IOPS is the top end of what fio reports
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
yep zfs is cpu intensive. YOu will need faster cores to hit that mark. either faster cores of the same vintage or faster but fewer cores of a more recent vintage.
 

Donny Davis

Contributor
Joined
Jul 31, 2015
Messages
139
Yea I think that is where I am bottlenecking at. I am going to do some BIOS tuning and report back.
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
About the only thing you could do it turn on HT if it's off..otherwsie you need a faster/more count cpu as i mentioned earlier..:)
 

Donny Davis

Contributor
Joined
Jul 31, 2015
Messages
139
So my next question may be completely moot - but if I wanted to create a single pool that runs at maximum speed and lowest latency should I

create a striped pool of disks or add many vdevs with a single disk in them? Not worried about failures at all - just performance
 

Donny Davis

Contributor
Joined
Jul 31, 2015
Messages
139
Well I see a small improvement turning hyperthreading *off*
If something is CPU bound like it was stated before - single core performance starts to matter. I was able to squeeze 40K more IOPS out of this machine by disabling hyperthreading.

Code:
fio --filename=test --direct=1 --rw=randrw  --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=8 --runtime=60 --group_reporting --name=4ktest --ioengine=psync --size=4G --bs=4k
4ktest: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=128
...
fio-3.5
Starting 8 processes
4ktest: Laying out IO file (1 file / 4096MiB)
Jobs: 8 (f=8): [R(8)][92.3%][r=2708MiB/s,w=0KiB/s][r=693k,w=0 IOPS][eta 00m:01s]
4ktest: (groupid=0, jobs=8): err= 0: pid=10072: Thu Jan  2 14:33:55 2020
   read: IOPS=670k, BW=2616MiB/s (2743MB/s)(32.0GiB/12528msec)
 
Last edited:

Donny Davis

Contributor
Joined
Jul 31, 2015
Messages
139
As far as I know there is nothing left I can disable in that realm. Its on OS control ATM - but I am about to put it on max performance. Will report back with those results.
 

Donny Davis

Contributor
Joined
Jul 31, 2015
Messages
139
Well this is a bummer to say, but with my current setup it looks like ZFS gets roughly half of what mdadm does in benchmarking. That said I am confident at this point its not actually ZFS or FreeNAS

fio cannot use libaio which is seemingly much faster at this task than psync or sync. I get roundabout the same numbers on linux using the same library. This test is the flaw.

On to more important things - because most people don't use a NAS locally - they use it via some form of sharing. The next segment of this thread will be taking this system to the 40G network and having some fun.

I will be shooting for 100K READ IOPS via NFS - From where I am at right now - I get about 30K.

Improvements made so far - turned off hyperthreading to get 30K more IOPS
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
BSD doesn't like HT so turning it off is a good thing..i missed that one..:)
 
Top