A hilly iSCSI write performance (extreme COW)

xness · Jun 3, 2022

Hello TrueNAS community,
I recently fell into the beautiful world of ZFS+TrueNAS and just built the first appliance.

I benchmarked quite a bit with dd and bonnie++ to get an idea what the limits of the HBA controllers simultaneous write-performance was – but quickly figured they wouldn't be representative of a real-world scenario. So I created a 4K block-size iSCSI share, hooked it up to a 10GbE server and formatted it with NTFS.

Now I know ZFS is a Copy-On-Write system, but I expected the submitting of the writes to be less impactful and I'm not sure the extreme performance variation I'm experiencing is what's to be expected. It sometimes climbs to 1 GB/s and the drops all the way to 0 Byte/s for a couple of seconds. I would feel much better, if it just averaged out in-between.

I recorded a quick video of it below.

Anyways – here are my specs configuration. I know 10 disks is the absolute maximum any given pool should have and the resilvering time for a pool of this size is likely not ideal.

General hardware:

Motherboard: Supermicro X11SPI-TF
Processor: Intel® Xeon® Silver 4110 Processor
RAM: 96GB DDR4 (6x 16 GB DDR4 ECC 2933 MHz PC4-23400 SAMSUNG)
Network card: On-board 10GbE (iperf3 shows 8Gbps throughput)
Controller: LSI SAS9207-8i 2x SFF-8087 6G SAS PCIe x8 3.0 HBA

Drives:

Boot Drives: 2x 450GB SAMSUNG MZ7L3480 (via SATA)
Pool: 10x 18TB WDC WUH721818AL (raidz2)

Code:

root@lilith[~]# zpool status -v
  pool: Goliath
 state: ONLINE
config:


        NAME                                            STATE     READ WRITE CKSUM
        Goliath                                         ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/7a492ffe-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0
            gptid/7a57523c-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0
            gptid/7a4cccdb-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0
            gptid/7a5554d2-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0
            gptid/7a501918-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0
            gptid/7a852e97-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0
            gptid/7a4f10b4-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0
            gptid/7a1ba28a-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0
            gptid/7a52cf0d-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0
            gptid/7a4b4df4-d841-11ec-92b7-3cecef0f0024  ONLINE       0     0     0


errors: No known data errors


  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:04 with 0 errors on Wed Jun  1 03:45:04 2022
config:


        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

errors: No known data errors

Happy for any help. If I missed some information that's required, please let me know.

NugentS · Jun 3, 2022

Is it possible you are saturating the write capacity of the HDD's. That they cannot keep up with the amount of data you are throwing at the NAS. Does the performance start off good and then die?

In simplistic terms a RAIDz array has the performance of a single disk = 150MB/s

Try the following:
1. iSCSI by default = sync writes and Sync writes to an HDD RAIDz Pool = Slow. Try setting sync=disabled on the dataset with the zvol - does this change things. Not a nessesarily data safe solution - but it should indicate if its a sync write issue (likley)
2. Use a 1Gb NIC on the NAS rather than a 10Gb NIC and see if performance is consistent. I am not suggesting this as a long term solution - just a test.
3. Try the same test (with 10Gb) using the pool setup as mirrors [5 striped mirrors]- does the same happen?

sretalla · Jun 4, 2022

NugentS said:
In simplistic terms a RAIDz array has the performance of a single disk = 150MB/s

Actually a little bit incorrect... what you possibly meant to say is the IOPS of a single disk... throughput can be higher than one single disk depending on the width of the VDEV.

xness · Jun 4, 2022

NugentS said:
Is it possible you are saturating the write capacity of the HDD's. That they cannot keep up with the amount of data you are throwing at the NAS. Does the performance start off good and then die?

Maybe – but gstat seems to indicate the disks are not 100% busy when the performance dips. In the video you can see that it's not the usual starting out good and then level off-kind of copy experience, more like extreme performance bursts and dips over and over again.

NugentS said:
1. iSCSI by default = sync writes and Sync writes to an HDD RAIDz Pool = Slow. Try setting sync=disabled on the dataset with the zvol - does this change things. Not a nessesarily data safe solution - but it should indicate if its a sync write issue (likley)

With sync disabled, I initially get a sweet 750 MB/s - 1,2 GB/s until it averages out at around 200 MB/s - 400 MB/s.

NugentS · Jun 4, 2022

When I first wrote my reply I did add, "but can perform better with large sequential writes" but left it out for simplicity as its workload dependant (but I suppose does apply to the OP's current test parameters.)

I think your correction is better than my expanded (but left out) initial thoughts

mav@ · Jun 14, 2022

As told above, RAIDZ has IOPS of single disk. It is less important for large sequential writes, but may hit you on read or random rewrites. RAIDZ width of 10 disk does not help either. In our systems we use RAIDZ for block storage only with SSDs and at width of 3-5 per vdev. HDD pools for block storage should better be in mirrors.

Also RAIDZ is incapable (or very inefficient) in storing small blocks. So yours "4K block-size iSCSI share" makes me worry. If you mean iSCSI extent settings -- it is fine, but you should not set zvol block size that low unless you really have to, and definitely not on RAIDZ. TrueNAS by default recommends minimal reasonable ZVOL block size, which for your 10-wide RAIDZ should be about 64KB. It may be not good for short random reads or rewrites, but as said for those cases wide RAIDZ just does not fit.

xness · Jun 27, 2022

Thanks for the insight. I wasn't aware the boot pool could not be used partly as a ZIL / SLOG cache drive – after dedicating one of the drives to the pool for SLOG, performance normalized.

I'm aware ZFS is bad for random read-writes, such as VM workloads – as also pointed our here (though 50% seems a bit extreme to me) – we're using it as a backup storage appliance. So mostly sequential read-writes with iSCSI for ReFS (64K block-size, 4K block-size aka maximum in iSCSI extent settings) and S3 for object storage (which I'm unsure of whether to activate deduplication for or not).

The pool is 10 disks wide, as it seemed to be the acceptable limit when aiming for maximum storage efficiency. Performance is secondary,– I was simply worried something might be wrong given the performance fluctuation. After all it seems like people get quite good performance with seemingly random hardware.

HoneyBadger · Jun 27, 2022

xness said:
I'm aware ZFS is bad for random read-writes, such as VM workloads – as also pointed our here (though 50% seems a bit extreme to me) – we're using it as a backup storage appliance. So mostly sequential read-writes with iSCSI for ReFS (64K block-size, 4K block-size aka maximum in iSCSI extent settings) and S3 for object storage (which I'm unsure of whether to activate deduplication for or not).

I think what's happening here is a bit of a misunderstanding of what the block/recordsize setting does here in ZFS - it acts as a maximum size limit for the records.

If you've actually created a ZVOL with a volblocksize of 4K, then you're putting a huge amount of overhead and strain on your poor spinning disks, as the minimum allocation size under TrueNAS is basically always 4K (due to ashift=12, which is 2^12 or 4096) so that plus the volblocksize has resulted in ZFS being able to only write 4K records to your disks. And with RAIDZ2, that means two parity blocks per 4K data block. You're basically getting only 1/3rd of the potential space here, and also a significant hit to performance as your drives are furiously flinging their read/write heads around trying to drop these little 4K records onto them.

Now, if you've left the ZVOL at the default 16K volblocksize and enforced a 4K allocation size at iSCSI, that's different - you're still enforcing the minimum at 4K but a maximum of 16K - slightly better, but still not optimal for spinning media.

I'm going to make an odd suggestion here, which is to take iSCSI out of the picture and present your storage over SMB (unless there's a software requirement to use iSCSI) - this will give you more sequential-friendly defaults of a 128K maximum recordsize, but you can also bump this up to 1M if you know you're going to be dropping large files.

What's the backup software here? Even if you aren't using VEEAM, there's a very well-written post here about the relationship between the file block size, filesystem allocation size, and the underlying storage block size.

Block repositories

Best practice from the field for Veeam Backup & Replication

bp.veeam.com

Looking at the very first diagram for reference, if your "bottom layer" is chopped up into 4K or even 16K chunks, that's going to make a lot of work for your disks. Maybe try a larger volblocksize such as 64K to match the ReFS block (and if you are using VEEAM, check KB2792 on ReFS use for the backup repository here: https://www.veeam.com/kb2792 )

NugentS said:
iSCSI by default = sync writes

Got this backwards I'm afraid, iSCSI is async by default - you're thinking of NFS where "sync=standard" is equivalent to "sync=always"

xness · Jun 27, 2022

HoneyBadger said:
If you've actually created a ZVOL with a volblocksize of 4K, then you're putting a huge amount of overhead and strain on your poor spinning disks, as the minimum allocation size under TrueNAS is basically always 4K (due to ashift=12, which is 2^12 or 4096) so that plus the volblocksize has resulted in ZFS being able to only write 4K records to your disks. And with RAIDZ2, that means two parity blocks per 4K data block. You're basically getting only 1/3rd of the potential space here, and also a significant hit to performance as your drives are furiously flinging their read/write heads around trying to drop these little 4K records onto them.

I'm using 128K on the ZVOL, but 4K at the iSCSI, as that's the maximum value you can set in the advanced settings. Default is 512b.

Mostly using iSCSI as a temporary workaround as we short-term have to process a 50 TB machine via Veeam CSP and don't have the storage for two fulls on any other appliance right now. Because we store the data in a dedicated vhdx (so basically ZFS > iSCSI ReFS > VHDX > ReFS…), iSCSI is required. Originally planned to use the ZFS appliance for S3 storage only.

Especially the graphics in that Veeam Best Practices article have indeed been very valuable.

PS: I talked to NugentS somewhere else – because my posts initially required moderation on here – and disabling the sync setting indeed "improved" performance, as in made it sat consistently at 400 MB/s
PSS: @HoneyBadger – do you have any experience when it comes to Veeam and Deduplication (with ZFS)?

HoneyBadger · Jun 28, 2022

xness said:
I'm using 128K on the ZVOL, but 4K at the iSCSI, as that's the maximum value you can set in the advanced settings. Default is 512b.

Gotcha - the iSCSI extent setting is equivalent to the "sector size" on a physically attached HDD, in that it's the "minimum granularity" the block device/extent will allow.

xness said:
Mostly using iSCSI as a temporary workaround as we short-term have to process a 50 TB machine via Veeam CSP and don't have the storage for two fulls on any other appliance right now. Because we store the data in a dedicated vhdx (so basically ZFS > iSCSI ReFS > VHDX > ReFS…), iSCSI is required. Originally planned to use the ZFS appliance for S3 storage only.

That's a fair bit of intermediary layers (ZFS exporting iSCSI with ReFS inside there, storing a VHDX with ReFS inside it? Am I understanding that correctly?) so chopping out some of the middlemen might be useful by using a NAS repository in VEEAM instead. But only you know your environment, so I might be missing something obvious here as to why you can't migrate there.

xness said:
PS: I talked to NugentS somewhere else – because my posts initially required moderation on here – and disabling the sync setting indeed "improved" performance, as in made it sat consistently at 400 MB/s
PSS: @HoneyBadger – do you have any experience when it comes to Veeam and Deduplication (with ZFS)?

I haven't experimented much with ReFS but if it's expecting or pushing synchronous writes, it will definitely have been a bottleneck on an array of spinning disk with no SLOG device. Disabling it improves the speed, but I'd be cautious as the VEEAM KB specifically calls out a "power loss with pending metadata update" as a potential data-loss scenario using ReFS.

As far as deduplication - let VEEAM handle it. It's done in a more "offline" or asynchronous manner, gets to the data before transmission (where CPU performance should be at a surplus) and doesn't result in the same kind of overhead that you would get if you do it at the block level with ZFS.

xness · Jun 28, 2022

HoneyBadger said:
Gotcha - the iSCSI extent setting is equivalent to the "sector size" on a physically attached HDD, in that it's the "minimum granularity" the block device/extent will allow.

Aah – didn't know that was a thing, but guess it makes sense!

HoneyBadger said:
[...] chopping out some of the middlemen might be useful by using a NAS repository in VEEAM instead. [...] I might be missing something obvious here as to why you can't migrate there.

You're right; it is less than ideal. Unfortunately we're a Veeam Cloud Service Provider (CSP). The CSP console is heavily constrained when it comes to things like scale-out repos, off-site backups or any other form of non-block storage. The only supported way to achieve immutability or long-time archiving is tape.
So we're passing through a vhdx to the VM running CSP, so that we can back it up with our Veeam console outside on the hypervisor and use all the convenient repository options. Definitely a special use-case here.

HoneyBadger said:
As far as deduplication - let VEEAM handle it. It's done in a more "offline" or asynchronous manner, gets to the data before transmission (where CPU performance should be at a surplus) and doesn't result in the same kind of overhead that you would get if you do it at the block level with ZFS.

The issue is that Veeam GFS backups don't seem to use the ReFS API and thus not gaining any space saving benefits when pushed to a CSP – even though it's practically a synthetic full. That means they use up the size of an active full backup;– probably always like that if you do a Copy job, but haven't tested.

In our scenario with the long-time archiving via vhdx backup mentioned above, I think deduplication on a separate ZFS pool to the iSCSI one to MinIO S3 is probably the best way to achieve a long-term archiving scenario. Especially since v12 releasing Q4/2022 will support direct backup to object storage without an intermediate block storage.

Important Announcement for the TrueNAS Community.

A hilly iSCSI write performance (extreme COW)

xness

Dabbler

Attachments

NugentS

MVP

sretalla

Powered by Neutrality

xness

Dabbler

NugentS

MVP

mav@

iXsystems

xness

Dabbler

HoneyBadger

actually does care

Block repositories

xness

Dabbler

HoneyBadger

actually does care

xness

Dabbler

Similar threads