Am I getting expected performance from the desktop hardware?

orddie · Dec 27, 2022

Processor: AMD 3700x
Memory: 128GB
Disks: 20 disks, 10vDevs (Mirrors), all SSD's - Flash
TrueNAS-SCALE-Bluefin
no log or cache enabled on the pool.

Below is the fio. Is there a better way to test?

root@tn[/mnt/Tank]# fio --filename=test --direct=1 --rw=randrw --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=12 --runtime=60 --group_reporting --name=4ktest --ioengine=psync --size=4G --bs=4k

Code:

4ktest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=128
...
fio-3.25
Starting 16 processes
4ktest: Laying out IO file (1 file / 4096MiB)
Jobs: 16 (f=16): [r(16)][100.0%][r=613MiB/s][r=157k IOPS][eta 00m:00s]
4ktest: (groupid=0, jobs=16): err= 0: pid=3476836: Sun Dec 25 17:05:48 2022
  read: IOPS=157k, BW=612MiB/s (641MB/s)(35.8GiB/60001msec)
    clat (usec): min=2, max=44261, avg=98.45, stdev=344.53
     lat (usec): min=2, max=44261, avg=98.64, stdev=344.64
    clat percentiles (usec):
     |  1.00th=[    8],  5.00th=[   11], 10.00th=[   14], 20.00th=[   18],
     | 30.00th=[   23], 40.00th=[   31], 50.00th=[   48], 60.00th=[  122],
     | 70.00th=[  151], 80.00th=[  172], 90.00th=[  188], 95.00th=[  202],
     | 99.00th=[  247], 99.50th=[  343], 99.90th=[ 2114], 99.95th=[ 7373],
     | 99.99th=[16188]
   bw (  KiB/s): min=467157, max=838431, per=100.00%, avg=626911.93, stdev=3761.33, samples=1904
   iops        : min=116789, max=209606, avg=156724.97, stdev=940.33, samples=1904
  lat (usec)   : 4=0.01%, 10=3.35%, 20=21.29%, 50=26.00%, 100=4.57%
  lat (usec)   : 250=43.82%, 500=0.69%, 750=0.08%, 1000=0.03%
  lat (msec)   : 2=0.05%, 4=0.03%, 10=0.04%, 20=0.03%, 50=0.01%
  cpu          : usr=4.80%, sys=83.94%, ctx=153163, majf=0, minf=206
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=9393058,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=612MiB/s (641MB/s), 612MiB/s-612MiB/s (641MB/s-641MB/s), io=35.8GiB (38.5GB), run=60001-60001msec

sretalla · Dec 28, 2022

orddie said:
Is there a better way to test?

What are you testing for?

Do you want to check your IOPS? (seems you're doing pretty well at 100-200K IOPS).

Do you want to test your throughput? (I guess you might see 600MB/s as slow... )

Maybe try with larger blocks (at least 128K, to match your recordsize for the dataset... and optionally increase it to 1MB for the dataset and then for the fio test), you'll likely see higher throguhput with those settings.

orddie · Dec 28, 2022

Thank you!

Below are the commands I used based on the your feedback. can you please review and ensure I did as you intended?

==== BS to 128k
fio --filename=test --direct=1 --rw=randrw --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=12 --runtime=60 --group_reporting --name=4ktest --ioengine=psync --size=4G --bs=128k

Code:

4ktest: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=128
...
fio-3.25
Starting 12 processes
4ktest: Laying out IO file (1 file / 4096MiB)
Jobs: 11 (f=11): [r(5),_(1),r(6)][80.0%][r=12.9GiB/s][r=105k IOPS][eta 00m:01s]
4ktest: (groupid=0, jobs=12): err= 0: pid=1226539: Wed Dec 28 19:33:49 2022
  read: IOPS=105k, BW=12.9GiB/s (13.8GB/s)(48.0GiB/3733msec)
    clat (usec): min=7, max=30094, avg=104.43, stdev=166.33
     lat (usec): min=7, max=30094, avg=104.58, stdev=166.35
    clat percentiles (usec):
     |  1.00th=[   17],  5.00th=[   38], 10.00th=[   47], 20.00th=[   58],
     | 30.00th=[   75], 40.00th=[   89], 50.00th=[   99], 60.00th=[  109],
     | 70.00th=[  121], 80.00th=[  145], 90.00th=[  169], 95.00th=[  184],
     | 99.00th=[  208], 99.50th=[  221], 99.90th=[  334], 99.95th=[ 1020],
     | 99.99th=[ 7111]
   bw (  MiB/s): min=11175, max=16107, per=100.00%, avg=13406.98, stdev=180.32, samples=75
   iops        : min=89401, max=128862, avg=107254.95, stdev=1442.55, samples=75
  lat (usec)   : 10=0.01%, 20=1.39%, 50=11.70%, 100=38.35%, 250=48.38%
  lat (usec)   : 500=0.09%, 750=0.02%, 1000=0.02%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=2.95%, sys=94.48%, ctx=4915, majf=0, minf=158
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=393216,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=12.9GiB/s (13.8GB/s), 12.9GiB/s-12.9GiB/s (13.8GB/s-13.8GB/s), io=48.0GiB (51.5GB), run=3733-3733msec

==== BS to 1 MB
fio --filename=test --direct=1 --rw=randrw --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=12 --runtime=60 --group_reporting --name=4ktest --ioengine=psync --size=4G --bs=1MB

Code:

4ktest: (g=0): rw=randrw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=128
...
fio-3.25
Starting 12 processes
Jobs: 12 (f=12): [r(12)][75.0%][r=13.1GiB/s][r=13.4k IOPS][eta 00m:01s]
4ktest: (groupid=0, jobs=12): err= 0: pid=1285036: Wed Dec 28 19:36:40 2022
  read: IOPS=13.5k, BW=13.1GiB/s (14.1GB/s)(48.0GiB/3653msec)
    clat (usec): min=81, max=27398, avg=815.52, stdev=586.53
     lat (usec): min=81, max=27398, avg=815.88, stdev=586.61
    clat percentiles (usec):
     |  1.00th=[  145],  5.00th=[  289], 10.00th=[  363], 20.00th=[  433],
     | 30.00th=[  562], 40.00th=[  635], 50.00th=[  709], 60.00th=[  807],
     | 70.00th=[  930], 80.00th=[ 1237], 90.00th=[ 1385], 95.00th=[ 1467],
     | 99.00th=[ 1631], 99.50th=[ 2212], 99.90th=[ 8029], 99.95th=[11076],
     | 99.99th=[19268]
   bw (  MiB/s): min=11785, max=16767, per=100.00%, avg=13782.36, stdev=161.61, samples=69
   iops        : min=11785, max=16764, avg=13781.08, stdev=161.53, samples=69
  lat (usec)   : 100=0.10%, 250=3.13%, 500=22.29%, 750=29.09%, 1000=18.31%
  lat (msec)   : 2=26.53%, 4=0.28%, 10=0.21%, 20=0.06%, 50=0.01%
  cpu          : usr=1.10%, sys=95.19%, ctx=6400, majf=0, minf=144
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=49152,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=13.1GiB/s (14.1GB/s), 13.1GiB/s-13.1GiB/s (14.1GB/s-14.1GB/s), io=48.0GiB (51.5GB), run=3653-3653msec

sretalla · Dec 29, 2022

You're certainly getting much better throughput (almost 20x) now that you're testing the real dimentions of the pool/dataset with 128K doing only a little worse than 1M, so I would say all is good with the pool and the tests look right.

Depending on the intended use of your pool, you may want to go back to the 128K or even smaller recordsizes... you have proven that your pool is capable of 200K IOPS or 13GiB/s (just not at the same time)... the rest is up to you.

orddie · Dec 29, 2022

is the above test both read and write?
I'm OK with 100k IOPS and 13(ish)GBs considering what the hardware is.

use case will be Vmware hosts. I only have a single 10GBE network connection to the box so anything above 10 is not possible.

sretalla · Dec 29, 2022

You'll need to use smaller recordsizes/block size as I understand that VMware uses 32k blocks (reasearch yourself to be sure of the number, I'm not an expert in that, here's some kind of starting point https://openzfs.readthedocs.io/en/latest/performance-tuning.html#virtual-machines).

Also, for block storage, you're not really looking at throughput as a bottleneck in any case as you'll be IOPS bound with sync writing (maybe consider a SLOG if you're serious about performance, but feel free to see how you do without one since it's an all SSD pool).

You're only reporting READ speeds there.

Use rw=write to test write speeds.

I use a setup like this:

fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=write --size=50g --io_size=1500g --blocksize=128k --iodepth=16 --direct=1 --numjobs=16 --runtime=120 --group_reporting

orddie · Dec 29, 2022

sretalla said:
I use a setup like this:

fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=write --size=50g --io_size=1500g --blocksize=128k --iodepth=16 --direct=1 --numjobs=16 --runtime=120 --group_reporting

I ran this command against my system and see the following results.

fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=write --size=50g --io_size=1500g --blocksize=128k --iodepth=16 --direct=1 --numjobs=16 --runtime=120 --group_reporting

Code:

TEST: (groupid=0, jobs=16): err= 0: pid=1734839: Thu Dec 29 07:37:22 2022
  write: IOPS=66.7k, BW=8331MiB/s (8736MB/s)(976GiB/120003msec); 0 zone resets
    clat (usec): min=9, max=132444, avg=225.07, stdev=1009.21
     lat (usec): min=10, max=132446, avg=236.98, stdev=1017.55
    clat percentiles (usec):
     |  1.00th=[   26],  5.00th=[   38], 10.00th=[   47], 20.00th=[   64],
     | 30.00th=[   82], 40.00th=[  101], 50.00th=[  119], 60.00th=[  135],
     | 70.00th=[  149], 80.00th=[  165], 90.00th=[  219], 95.00th=[  486],
     | 99.00th=[ 2474], 99.50th=[ 4817], 99.90th=[13304], 99.95th=[19530],
     | 99.99th=[38011]
   bw (  MiB/s): min=  190, max=20619, per=100.00%, avg=8343.11, stdev=306.12, samples=3824
   iops        : min= 1526, max=164952, avg=66743.73, stdev=2448.92, samples=3824
  lat (usec)   : 10=0.01%, 20=0.26%, 50=11.79%, 100=27.60%, 250=51.45%
  lat (usec)   : 500=4.00%, 750=1.49%, 1000=0.81%
  lat (msec)   : 2=1.32%, 4=0.61%, 10=0.50%, 20=0.11%, 50=0.04%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=5.99%, sys=40.81%, ctx=3607092, majf=0, minf=221
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,7998381,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=8331MiB/s (8736MB/s), 8331MiB/s-8331MiB/s (8736MB/s-8736MB/s), io=976GiB (1048GB), run=120003-120003msec

sretalla · Dec 29, 2022

Writing at an average of 66K IOPS isn't bad with a throughput of 8GiB/s.

No doubt the throughput would be lower with 4K or 32K blocks though. You may mitigate that a bit with recordsize reduction. (remembering that you'll possibly be using ZVOLs/iSCSI anyway, so can set the blocksize)

orddie · Dec 29, 2022

what do people do / use for systems that have lower record size requirements? We saw the performance hits when I set the tests to use 4k. I feel that's leaving a lot of performance on the table.

the other interesting fact is when the system is in use and VM's are making use of the system, I do not see much CPU use. Currently, I'm connected via iSCSI to the host.

sretalla · Dec 29, 2022

orddie said:
We saw the performance hits when I set the tests to use 4k. I feel that's leaving a lot of performance on the table.

Don't mistake huge numbers of small reads reducing throughput for a bunch of potential performance sitting idle.

If you have a limit to IOPS and each of those IOPS is only 4K, you need to do 32 IOPS to transfer the same data you can transfer in 1 IOP of 128K if you're moving a lot of data around, you can see how the IOPS add up quickly to move around 1GiB (or more).

But if your consumer isn't sending you 128K blocks all the time, there's not much point to having a recordsize of 128K.

Things like being sequential/unfragmented still matter (due to ZFS transaction grouping), but can matter a bit less on SSDs.

You can certainly play around with block/record sizes and see what fio tells you about the optimal settings.

orddie · Dec 29, 2022

Seeing some interesting configs in TrueNas and Vmware when it comes to block size.

On the Truenas side
- The max block size I can set is 128 KB for a v-volume.
- The smallest block size I can set for and iSCSI export is 512

on the VMware side.
The only block size i can set it 1 MB

does not look like I can match across the board

Important Announcement for the TrueNAS Community.

Am I getting expected performance from the desktop hardware?

orddie

Contributor

sretalla

Powered by Neutrality

orddie

Contributor

sretalla

Powered by Neutrality

orddie

Contributor

sretalla

Powered by Neutrality

orddie

Contributor

sretalla

Powered by Neutrality

orddie

Contributor

sretalla

Powered by Neutrality

orddie

Contributor

Similar threads