Optimal pool settings for large sequential writes via SMB share

Chainsaw Juggler · May 24, 2019

Hi!

I'm currently trying to optimize my existing setup to yield better sequential write throughput via SMB from two Windows 10 machines over a 10 GbE link.
Machine specs of the NAS:

Supermicro CSE826 X9DRD-7LN4F-JBOD 19" 2U 12x 3,5" LFF
1 x Intel Xeon E5-2640 SR0KR 6C Server Processor 6x 2,50 GHz 15MB Cache
32GB Registered ECC DDR3 RAM
10 Gigabit Dual Port Intel X520, only one port used
The mainboard has 2x SFF-8087 LSI Broadcom 2308 SAS onboard IT HBA Mode
10x 6TB HGST HUS726060AL5210 (7,2k SAS 512e HDD) in a 8+2 RAID Z2 configuration (1 vdev)
FreeNAS 11.2-U4.1

The task is to copy large files (video camera recordings) from two Windows 10 systems to a SMB share on the NAS. Minimum file size is 40 GB (yes, 40 Gigabytes), maximum file size is 240 GB. There are no small files. The source files reside on Samsung 860 Evo SSDs, and the task is to copy them to the NAS as fast as possible. Each Windows 10 machine has a USB-SATA bridge to which a SSD is attached. The USB-SATA bridges max out at 370 MB/s read each and are thus a bottleneck on the client system. The Windows 10 machines are connected via 10 Gbit/s Ethernet to the NAS.

The pool will be empty at the start, and it won't be filled beyond 60%. Compression for the zpool has been switched off since the video is not really that compressible, and I'd rather save the CPU cycles for parity calculations and SMB.
Each client will only copy one file at a time to the NAS, and there is a ~30 second delay between subsequent copies. There is no concurrent read access. There is no random write access. Reads only happen while both clients do not write anything.

Right now I'm getting a sustained total of ~400 MB/s writes (i.e. 200 MB/s instead of 370 MB/s per client) when writing via SMB to the NAS, and I'd like to improve on that. Possible tuning knobs I've been looking at:

Increase ZFS record size (current status: default, 128K)
Increase ashift (current status: default, 12)
Enable compression (current status: off)
Mess with SMB settings on the client (Windows) side
Mess with SMB settings on the server (FreeNAS) side

I'm sure I missed some obvious tuning knobs despite reading the various great guides. I also have no idea whether any of the tuning knobs listed above have any meaningful effect. My forum searches didn't really yield the info I'm looking for, but maybe I just used the wrong search keywords.

I'd appreciate any guidance, even if it's RTFM (which of the manuals?) and/or the correct search phrases to find useful results.

Chris Moore · May 24, 2019

Chainsaw Juggler said:
(1 vdev)

There is the problem. Having just one vdev limits your performance to roughly the performance of a single drive. More vdevs are generally equal to more IOPS. If you have a need for speed, you need more vdevs and the easy way to get there is to use mirror vdevs. It cuts your storage capacity, but it vastly increases your IO capacity.

Chainsaw Juggler said:
Enable compression (current status: off)

You want to have compression on because it reduces the quantity of data that must be written to disk, giving you more bang for the buck on IO. If you use LZ4 compression, it doesn't cost you much CPU time and you have more than plenty CPU capacity. LZ4 also gives up fairly quickly when a data block is not compressible but if something is compressible, it will be.

Chainsaw Juggler said:
32GB Registered ECC DDR3 RAM

More memory normally helps with caching but with files this large, it may not be possible to cache anything.

Chainsaw Juggler said:
2U 12x 3,5" LFF

With that number of drive bays, I would plan to fill them all, but I would also pick a 4U system so I could have more drives. There is nothing like having more drives when speed is your goal.

Chainsaw Juggler · May 25, 2019

Chris Moore said:
There is the problem. Having just one vdev limits your performance to roughly the performance of a single drive. More vdevs are generally equal to more IOPS. If you have a need for speed, you need more vdevs and the easy way to get there is to use mirror vdevs. It cuts your storage capacity, but it vastly increases your IO capacity.

I'm looking for throughput, not IOPS. Write throughput with a single vdev already vastly exceeds the throughput of a single drive (tested by copying a 200 GB video file to the NAS, compression factor 1.14x with LZ4, caching effects should be negligible at that file size). iXsystems also says that throughput (streaming read, streaming write) scales with the number of disks (minus parity) in a single vdev.

AFAICS (unless ixSystems are mistaken) streaming write throughput for a single-vdev RAIDZ2 is larger than for a dual-vdev RAIDZ2 if the number of disks is kept constant. iXsystems also writes that for IOPS the opposite is true. I know that lots of people in this forum disagree with iXsystems on the first point, but I never found out why.

Admittedly my workload is unusual... no random access, total number of files is ~300, workload is streaming write only.

Chris Moore said:
You want to have compression on because it reduces the quantity of data that must be written to disk, giving you more bang for the buck on IO. If you use LZ4 compression, it doesn't cost you much CPU time and you have more than plenty CPU capacity. LZ4 also gives up fairly quickly when a data block is not compressible but if something is compressible, it will be.

Right. Compression is now on. I get roughly 1.14x compression from LZ4. The CPUs are mostly idle even with LZ4 on, so this may help a bit. I just worried that after compression block size may be uneven, and thus RAIDZ2 may incur significant overhead.

Chris Moore said:
With that number of drive bays, I would plan to fill them all, but I would also pick a 4U system so I could have more drives. There is nothing like having more drives when speed is your goal.

I just tested adding more drives to the vdev and indeed it also helped with write speed over SMB. Thanks.

SweetAndLow · May 25, 2019

I have never seen that Intel 10gig do more than 500mbps. And yes more vdevs will always help performance even though it's said to only help iops. Iops are almost always the bottleneck so more is better.

Chris Moore · May 25, 2019

Chainsaw Juggler said:
I know that lots of people in this forum disagree with iXsystems on the first point, but I never found out why.

I don't think it is a disagreement. I think you may have a misunderstanding. I have a lot of experience with the differences in performance between vedv / pool layout from my work. I know from direct observation and use, not theoretical, how different configurations perform. In the group of servers I manage at work I have one server with 60 drives that has the drives split into four vdevs of 15 drives each. I also have a server with 60 drives that has the drive split into ten vdevs of six drives each and I can tell you with certainty that the system with ten vdevs is able to fully saturate a 10Gb network link where the system with four vdevs is doing all it can manage if it is doing half that and most of the time it performs in the 250Mb/s to 500Mb/s range.

Chainsaw Juggler · Jun 11, 2019

Now that the event is over, I can share my results.

Video footage was recorded on Blackmagic Design HyperDeck Studio 2 and similar devices to Samsung 860 Evo SSDs. The file system on the SSDs was exFAT. Importing the SSDs directly into the FreeNAS system was not possible due to missing exFAT support, so we used two Windows clients connected over a 10Gb/s network to the FreeNAS server. The recordings were single large files at 30-260 GB each. In total, we had to copy roughly 20 TB of such large files to the NAS. Reading the files during the event was not a concern, we just needed to archive video footage for later editing.

The disk array was a single vdev of 10x 6TB HGST HUS726060AL5210 (7,2k SAS 512e HDD) in a 8+2 RAID Z2 configuration.

SweetAndLow said:
I have never seen that Intel 10gig do more than 500mbps.

I didn't have time to change the Intel X520 to another NIC, so I measured with the configuration I had. Peak real-world performance was >820 MB/s via Samba at three concurrent file copies to the FreeNAS server. Improved drivers for the Intel X520 in recent FreeNAS releases may have contributed to that result.

Chris Moore said:
I have a lot of experience with the differences in performance between vedv / pool layout from my work. I know from direct observation and use, not theoretical, how different configurations perform. In the group of servers I manage at work I have one server with 60 drives that has the drives split into four vdevs of 15 drives each. I also have a server with 60 drives that has the drive split into ten vdevs of six drives each and I can tell you with certainty that the system with ten vdevs is able to fully saturate a 10Gb network link where the system with four vdevs is doing all it can manage if it is doing half that and most of the time it performs in the 250Mb/s to 500Mb/s range.

From ~~empirical testing~~ real-world usage, I can now tell you that a single vdev of 10 spinning drives can handle write-only large-file loads from two clients at a sustained total speed of 730 MB/s for a few hours.

My bottleneck were the USB-SATA adapters for reading SSDs in the clients (the best ones I could find for USB 3.0 got 370-380 MB/s read speed with UASP). I should have planned to have two adapters per client instead of one, that would have moved the bottleneck elsewhere. A SD card reader on one of the clients was able to push an additional 90 MB/s over Samba to the NAS without any visible reduction in write speed for the copy from SSD to the NAS on the same client.

Hints for anyone planning something similar:

USB3-SATA adapters are slow (350-380 MB/s for UASP, <300 MB/s for BOT) even if a fast SSD is plugged into the adapter
Parallelize writes to the NAS, use more than one client to benefit from Samba 4.9 speed improvements
Getting Windows 10 to use Jumbo Packets is difficult, especially with an Asus XG-C100C
Jumbo Packets might help, but even with a normal MTU of 1500 you get really decent speeds (the measurements above are with a MTU of 1500)
FreeNAS 11.2-U4.1 has really fast network drivers and a really fast Samba version, don't settle for older versions
Use frequent snapshots, people doing the copying will make mistakes and delete stuff by accident
Keep using snapshots after the event because people repairing files may "repair" the wrong files
The files you get may already be broken, make sure you verify their contents in some way (e.g. for videos use "mplayer -identify") before you tell people that everything is archived (we had at least one video file with 4.2 GB of zeroes at the end instead of the expected content)
Keep everything on two servers, use two different sets of client hardware to copy the files (one set of hardware per server) and copy each source medium twice (once per server) to take care of systematic hardware problems
If you plan to compare the sets of files between the two servers after the event, please be aware that the sha1 utility in FreeBSD is horribly slow, running "openssl sha1" is faster, but you'll still have to parallelize hashing because CPU is your bottleneck
Back up your config
Snapshots, they will save your data and the day

Mlovelace · Jun 12, 2019

SweetAndLow said:
I have never seen that Intel 10gig do more than 500mbps. And yes more vdevs will always help performance even though it's said to only help iops. Iops are almost always the bottleneck so more is better.

This is copying a file via SMB from the freeNAS Server 2 in my signature which has a X520-DA2. The Intel cards need some tuning but can work quite well.

Sharethevibe · Aug 27, 2019

I have set up a similar NAS, in that it only holds (fairly) large files: 50% around 50MB, 50% around 10MB.
(music collection, approx 1 million files).

Workload is
(1)
adding/writing large quantities (say runs of 200.000 files/TB's) to the NAS and vise versa, reading large quantities to copy elsewhere.
(e.g. via connection by 10Gb to fast workstation).
(2)
And next to that mass reading (writing to) the tagdata of the files (tagdata usually say 100kB size).
(say of 200.000 files the tags are read, in order to do comparing etc).

Relevant to know is that typically the Cache (RAM/NVME-SSD) will not hold the files that are read (it's 99% other files, stored in the disk-Pool).

The signature below shows my current set-up (disks still empty, currently setting up/testing).

The goal is to have at least a 500MB/s speed when transfering files.
And a tagdata-readspeed that will utilize the cpu's of the NAS (and the cpu's of the workstation, analysing the tagdata). In a 3-disk-RAID0 pre-test set-up under Windows the disks were the bottleneck (2x Xeon E5-2670 of workstation running at 20%).

Question is what and how to tune to achieve this.
My main observations/questions:
a)
IOPS/max record size are limiting the transfer-speed? Throughput = IOPS x max recordsize?
Having 2 VDev's and WD Reds (EZAZ/EMAZ) probably doing say 75 IOPS/disk that's 150 x 1MB, so 150 MB/s?
(I saw what Chainsaw had as sustained speed (730MB/s) but also read e.g cyberjocks input on that empty pools perform way better than filled pools because under that circumstance ZFS creates 'strings-in-1-go-to-write' I believe, so the 1MB max record size than is no limitation?).
(also did a first transfer-test with a 50GB-bundle (exceeding 32GB RAM) and that ran at 350MB/s (limited by source sata-SSD), but this may also be due to 'empty pool effect'?).

b)
As for tagdata-reading: here I assume IOPS is the limiting factor?
Meta-data of the ZFS-system (block-adresses etc) are stored on ARC/RAM (and L2ARC/NVME-SSD).
Can I use the same RAM/NVME-SSD to let these handle the file tagdata?
(tagdata as you may know is part of the music-file, stored in first or last sectors).

What advice would you give me for this set-up, when e.g. setting:
- recordsize
- a-shift size
- size of transaction group
- ZIL-use/sync-setting (intend to switch it off)(set 'disable')
- size of cache on RAM/NVME_SSD.

Newbie on the Free(NAS) but I'd like to set this up properly in order to handle the data-load effectively.
So all advice welcome! And questions just shout.

Tx in advance!

Chainsaw Juggler · Aug 27, 2019

Hi Sharethevibe,

welcome to the forum.

Sharethevibe said:
I have set up a similar NAS, in that it only holds (fairly) large files: 50% around 50MB, 50% around 10MB.
(music collection, approx 1 million files).

I hate to break it to you, but those files are not fairly large, they are small to medium sized at best. Please note that my scenario only allowed ignoring IOPS because my minimum file size was 40000MB (three orders of magnitude larger) and there were no read-modify-write cycles.

Sharethevibe said:
Workload is
(1)
adding/writing large quantities (say runs of 200.000 files/TB's) to the NAS and vise versa, reading large quantities to copy elsewhere.
(e.g. via connection by 10Gb to fast workstation).
(2)
And next to that mass reading (writing to) the tagdata of the files (tagdata usually say 100kB size).
(say of 200.000 files the tags are read, in order to do comparing etc).

Even with two striped vdevs, you will still feel the pain of too few IOPS for your scenario. With your file sizes, probably more than half of the time the disks will be busy seeking instead of reading.
IOPS are extremely access pattern dependent, and your first test of just writing a bunch of files to the NAS is in some ways almost a best case for performance. Read/modify/write cycles with a large working set are decidedly not fun. You may have to reconfigure your setup to get more IOPS from it, and I guess that more RAM and more vdevs will be needed. The L2ARC might even be somewhat ineffective in your scenario.

I strongly suggest to heed any advice by Chris Moore and others about optimal pool configurations.

Good luck!

Sharethevibe · Aug 28, 2019

Tx Chainsaw for yr swift reply.
Also did some more digging into the actual IOPS-performance of the hard-disks I've bought for the NAS.

Here's my extra input/reasoning (it indeed would be good if any of the guys with experience/knowledge would shed some light on this, indicating whether my understanding is right or not):

I) Re the transfer-speed:
I.1)
Determining factors for the speed when transferring files are:
- IOPS-performance of the disk
x
-maximum record-size (set in FreeNAS)(per dataset).

FreeNAS per file chooses optimal block-sizes, yet this is maximised by the set max record-size.
Thus, when having files that are allways larger than the max recordsize, it's the max recordsize that determines how much data is read/written per IO-action.

Side-note:
it does not matter much whether these are 'large or very large' files; when they are multitudes of the max recordsize (default 128kB; max 1MB), the transfer-process will be governed by the system almost continuously making blocks of 128kB / 1MB and reading/writing these.

I.2)
When the pool has much free space, the ZFS-system is able to create 'strings of blocks' that it reads/writes in 1 IO-action. Thus reading/writing much more per IO-action and having far greater transfer-speeds (up to 4-5x more).
This effect though already is almost gone (down to +40%), when having 50% used diskspace and effect is gone at 70%.
So only when you purposely overbuy/overplace disks (e.g. triple the expected data-size), you will have very good IOPS-performance, because of this. Stated otherwise: when having fair usage of your pool (70% usage, 30% free space), you must neglect this effect.
Source: (Jgreco input on this forum) https://www.ixsystems.com/community/threads/i-o-performance-planning.40002/

I.3)
On the IOPS-performance of the disk:
I have WD Reds, 8TB, NAS-types, 5400 rpm with 256MB cache (the 'EMAZ'-types, equal to the 'EFAZ'-types).
The FreeNAS forum gives a calculation for determining the IOPS of the single disk:
1/(avg seektime in msec+ avg latency in msec). This would give 1/(0,008+0,006) being approx 70 IOPS.

Yet, when retrieving the measured IOPS as given by Tomshardware.com it's:
- for read: 200-500 IOPS (que-depth 1 to 32)
- for write: 500-600 IOPS
Source: https://www.tomshardware.com/reviews/wd-red-10tb-8tb-nas-hdd,5277-2.html.
(and this is for the 128MB cache version, where mine are 256MB).

Also, when I compare it with similar tests for 3TB WD-Red disks, these have IOPS of 100-150.

So, it seems to me that in the real-world situation, this 8TB WD NAS-disk is way outperforming it's earlier/smaller brothers, and the IOPS of 200-600 are valid for this disk-type?
(WD with this disk probably making good use of the high datadensity and of the large cache and of TLER (reducing error-seektime) and whatever tricks WD knows in diminishing the seek/latency-times etc.?).

And as the runs in my processes typically are mass/large (100.000-200.000 files) I reckon that the IOPS-figures for the 'que-depth of 32' are most valid for my processes?

I.4)
Calculating with say 400 read-IOPS for single disk, resp. 500 write-IOPS: in a 2 VDev Pool, we get:
- 800 IOPS for read
- 1000 IOPS for write.
When setting the max record-size on 1MB (1028kB), the max transfer-speeds (when having 50/50 10MB/50MB-filemix) are similarly:
- 800 MB/s for read
- 1000 MB/s for write.

(the max bandwidth of the single disk being 250 MB/s, and having 8 disks in the pool, bandwidth far exceeds this and so is no bottleneck).

Having this 800/1000 as read/write-speeds is fine with me (my goal is minimum of 500 MB/s).

.
.

II)
Determining factor for the reading/writing of (100k) tagdata is only the IOPS-performance of the disk.
As the tagdata (part of the musicfile) is only 100kB, with each IO-action 1 file is handled.

Calculating with again 400 read-IOPS for single disk, resp. 500 write-IOPS: in a 2 VDev Pool, we get:
- 800 IOPS for read, so 800 files are read per sec
- 1000 IOPS for write, so 1000 files are written per sec.

Questions:
- tagdata being part of the musicfiles, and musicfile being in dataset with recordsize set on 1MB, does this mean anything for size of data read or written?).
- approx. 1000 tagdata-files (of 100k) per sec is good; but would there be a way to use the NVME-SSD for handling this tagdata? (this will offcourse for sure have a very high handling-speed with IOPS in the 100.000+ range).

.
.

Am I making mistakes in my reasoning, if so where? Or is this a fairly good estimation of the expected filetransfer / tagdatahandling -performance?

For the record:
- I neglect the use of the RAM/NVME-SSD for cache (expect little usage from that)
- RAM is 32 or 48GB (fast NVME-SSD of 500GB also available, but not in use yet)
- intend to switch the ZIL off ('sync set on disable'). So no lag from this.

Thanks in advance for your input!

Sharethevibe · Aug 31, 2019

Perhaps better to start a new thread on the specifics of my NAS?

Important Announcement for The TrueNAS Community.

Optimal pool settings for large sequential writes via SMB share

Chainsaw Juggler

Dabbler

Chris Moore

Hall of Famer

Chainsaw Juggler

Dabbler

SweetAndLow

Sweet'NASty

Chris Moore

Hall of Famer

Chainsaw Juggler

Dabbler

Mlovelace

Guru

Sharethevibe

Dabbler

Chainsaw Juggler

Dabbler

Sharethevibe

Dabbler

Sharethevibe

Dabbler

Similar threads