iSCSI performance question...

kitt001 · Jul 16, 2020

I'm building out a box with a good deal of storage for use as the iSCSI backend to a VMware cluster ... I did this once before and the guidance I got from here was extremely beneficial - unfortunately, the last system I built has been outgrown.

Basic build here is big but straight forward:

At least 8 cores in the CPU
128GB Memory
(2) 10GB SFP+ uplinks
(2) 240GB 6GB/s SATA SSD for OS
(2) 1.92TB PCIe M.2 NVMe SSD L2ARC
(2) 480GB 6GB/s SATA SSD ZIL
(18) 6TB 12GB/s SAS Spindles for the ZFS pool

The disks will be configured in 3-way mirrors yielding a final usable pool of 36TB, and an iSCSI volume of 18TB (all rough numbers). The main question is, will the 6TB disk mirrors kill the performance? Last time I went with 2TB disks to improve the performance, but there would be so many disks involved here, that the smallest disks I could really consider are 4TB.

I also considered an SSD backed RAIDz scenario ... but, I wasn't sure the performance would be there, plus, I don't think it could be extended as easily. Feel free to correct me.

Also, feel free to tell me I'm going about this all wrong and asking for trouble, I'm open to suggestions from the resident experts. Nothing is bought yet, so I can go a completely different direction.

Thanks

kitt001 · Jul 20, 2020

Anybody?

Samuel Tai · Jul 20, 2020

Since ZFS is copy-on-write, for iSCSI, you want the largest disks possible, to have sufficient free space so you don't end up on the wrong side of the ZFS performance curve (rule of thumb is < 50% occupancy), and for IO, spread these over as many VDEVs as practical.

Patrick M. Hausen · Jul 20, 2020

This might prove helpful:

The path to success for block storage

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle...

www.ixsystems.com

kitt001 · Jul 20, 2020

Patrick M. Hausen said:
This might prove helpful:

The path to success for block storage

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle...

www.ixsystems.com

Thanks, read through that and it largely reaffirmed what I learned during my last build.

I think my question though is more centered around the disk size impact ... I get that larger is better, so if that's the simple answer, then that's it; but the number of vdevs also plays in, and bigger disks makes fewer vdevs. Unless, I'm confused.

For example, below are several disk configurations that yield roughly the same end result storage wise. From a performance perspective, am I better off running more smaller disks?, or fewer larger ones? What's the trade off, or in this case, where's the sweet spot?

Qty/Size of Disks	Raw capacity	Total 3-way mirrored vdevs	Over all pool capacity	Usable for block storage
24/4TB	96TB	8	32TB	16TB
18/6TB	108TB	6	36TB	18TB
12/8TB	96TB	4	32TB	16TB
9/12TB	108TB	3	36TB	18TB

Fewer disks means smaller chassis' and lower per TB cost, but performance is the true objective ... so if spending more is the better route, does the performance gain justify the increased cost ...

Samuel Tai · Jul 20, 2020

Larger disks only really get you faster sequential access. For your use case, you'd be better off with more VDEVs for faster random access. I think your 18/6 TB line is probably your sweet spot.

kitt001 · Jul 20, 2020

Thanks ... is there much benefit to more cores in the CPU? Or is higher clock speed more of a win?
And is FreeNAS thead-aware?, or only core-aware?

Samuel Tai · Jul 20, 2020

FreeNAS is thread-aware, with some tunables settings:

Code:

hw.vmm.topology.cores_per_package: 4 <- number of physical cores per package
hw.vmm.topology.threads_per_core: 2  <- number of threads per core

That said, iSCSI is not CPU limited, especially if you have NICs that can perform DMA.

kitt001 · Jul 20, 2020

Another performance related question .. again, looking for the balance. In the L2ARC, obviously, faster is better, but how fast is fast enough?
For example, again ...

I can put in a "1.92TB Samsung PM983 Series M.2 PCIe 3.0 x4 NVMe Solid State Drive" that runs at 480K Read IOPs and 42K Write IOPs
or
I could use a "Intel Optane SSD DC P4800X Series 375GB PCIe 3.0 x4 NVMe Solid State Addon Card" that runs at 550k Read IOPs and 500k Write IOPs.

The Intel will run faster, but it gives me a quarter of the cache at over twice the price. If money was no object, I'd get a 1.5TB Optane card and call it a day - but money is always an object, so I need to be confidant the boost I'm likely to see is going to make the added cost of one approach over the other worth it.

kitt001 · Jul 20, 2020

And along those lines, would I be better served running the somewhat slower L2Arc, and diverting the 'savings' to boost the system memory to 256GB

Samuel Tai · Jul 20, 2020

Ways to leverage Optane in 12.0? (given 4k speed+latency, could Metadata+SLOG share devices, and is L2ARC less useful)

So for this question, I need to give a bit of background. BACKGROUND: MY SYSTEM UNDER 11.3: My pool data is very large, very dedupable files (20TB of VMs), to the point that the extra hardware for dedup was worth it to save HDD costs (3.5 x dedup, I'm saving about 200TB of SAS3 raw disk cost...

www.ixsystems.com

kitt001 · Jul 21, 2020

Thanks for the link, it has some useful information, but it largely seems to discuss the prospect of trying to have NVMe doing double duty for logging and caching, and how that would defeat the purpose of the NVMe; unless I misunderstood the course of the thread. It ultimately does point out the huge benefit of the L2Arc, which I'm not questioning, and does so using an Optane as reference ... but it doesn't really cover the scenario of "not quite as fast" NVMe, I don't think.

I'm far from an expert here, hence the questions, so forgive me if I'm completely wrong or off base; but logically, I figure, the process goes sort of like so ..

Data requested
Check L1Arc
If not found, check L2Arc and load to L1Arc
If still not found, load from disk in to L1Arc
Data served from L1Arc (or at the same time it's being loaded there)
as L1Arc overflows, write overflowing data to L2Arc (It's also possible that as data is loaded from the disks, that it is simultaneously loaded into both L1 and L2Arc caches, and L1 is just allowed to overflow .. which might make more sense)

Regardless of how/when the data gets to L2, the speed of the data being retrieved by the requester would theoretically be dictated by:

the speed of L1
the Read performance of L2 (not the Write performance, unless data was being read out faster than it could be written and buffers started getting backed up somewhere)
the speed of the disk access.

If there is any truth to this, then it seems that a greater capacity of L2Arc that's close to lightning fast for reads would be a better option than a smaller capacity that actually is lightning fast .. especially if the capacity of the L1Arc was increased to reduce the calls to L2 further.

Please feel free to educate me here as I'm really trying to understand pros/cons so that I can spend money where it makes the most sense.

Spearfoot · Jul 21, 2020

You're on the right track if you seek maximum performance for iSCSI. To go further, and in answer to some of your questions:

Install the maximum RAM supported by your motherboard. The rule of thumb is to do this first before adding an L2ARC device -- because memory is always faster than flash or disk.
Install a ZIL SLOG device with low latency, fast write speeds, high endurance, and power loss protection (i.e., a built-in 'supercapacitor'). Provided it has the above characteristics, your NVMe device would be much better suited for this purpose than the standard SATA 480GB/s SSDs you specified above.
Install an L2ARC device that supports I/O at a higher rate than the spinning disks. Capacity doesn't need to be more than ~4 times the size of your installed RAM. The 480GB SSDs you listed above would work in this role, but ditching them and using another NVMe device instead would increase perfomance yet more.

kitt001 · Jul 21, 2020

The design as it stands has changed from the above ... here's where I'm at with more specifics...

A SuperMicro 6049P-E1CR36L chassis with an X11DPH-T motherboard and:

(2) Xeon 3206R 8-Core CPUs (1.9Ghz, 11MB)
256GB 2933Mhz ECC Memory
(2) 256GB Micron SSDs (for the OS)
(1) 2TB Intel DC P4511 M.2 PCIe NVMe (for the L2Arc) (has Enhanced Power Loss Data Protection according to Intel's site)
(1) 375GB Intel Optane P4801X M.2 PCIe NVMe (for the ZIL/SLOG) (has Enhanced Power Loss Data Protection according to Intel's site)
(1) Intel X710-DA2 dual 10Gb/s SFP+ card
(20) 6TB SAS 12Gb/s 7200 disks (six 3-way mirrors plus two hot spares)

Anticipated yield is about 18TB for iSCSI storage behind a VMware cluster.

So, on topic, will it FreeNAS?

The next memory bump (512) adds another $1500 on top of an already expensive system, and while I'd love to "max" it out ... 2TB of memory runs around $20k, and the specs say the chassis is good for double that ... its not my money, but it's not a bottomless pit either :).

I'm still open to thoughts to course correct, and ideas if there is a better path worth looking at.

kitt001 · Jul 21, 2020

If I drop the 2TB P4511 for the L2, and go up to 512 System memory, I have a net cost change of about $900 ... but I feel like even though the L2 won't perform as well as the doubled L1, the greater capacity of the L2 may be more beneficial.
The VMDKs of a few hundred VMs will be living on this.

Spearfoot · Jul 21, 2020

kitt001 said:
The design as it stands has changed from the above ... here's where I'm at with more specifics...

A SuperMicro 6049P-E1CR36L chassis with an X11DPH-T motherboard and:

(2) Xeon 3206R 8-Core CPUs (1.9Ghz, 11MB)

256GB 2933Mhz ECC Memory

(2) 256GB Micron SSDs (for the OS)

(1) 2TB Intel DC P4511 M.2 PCIe NVMe (for the L2Arc) (has Enhanced Power Loss Data Protection according to Intel's site)

(1) 375GB Intel Optane P4801X M.2 PCIe NVMe (for the ZIL/SLOG) (has Enhanced Power Loss Data Protection according to Intel's site)

(1) Intel X710-DA2 dual 10Gb/s SFP+ card

(20) 6TB SAS 12Gb/s 7200 disks (six 3-way mirrors plus two hot spares)

Anticipated yield is about 18TB for iSCSI storage behind a VMware cluster.

So, on topic, will it FreeNAS?

The next memory bump (512) adds another $1500 on top of an already expensive system, and while I'd love to "max" it out ... 2TB of memory runs around $20k, and the specs say the chassis is good for double that ... its not my money, but it's not a bottomless pit either :).

I'm still open to thoughts to course correct, and ideas if there is a better path worth looking at.

I like it!

The 256GB OS drives are vastly over-sized. Your proposed motherboard has two SATA DOM ports; you could populate these with 16GB SuperMicro devices. They also offer these in 32, 64, and 128GB capacity.

Disk I/O is usually a system's throughput bottleneck; with your proposed system, I'm wondering if the 10Gb/s NIC will be yours. With so many VMs you may actually derive benefit from using both ports in an LACP configuration.

Perhaps @jgreco will see this thread and give his insight; he has a lot of experience with high-end, high-performance systems.

kitt001 · Jul 21, 2020

Considered the DOMs, but the 128GB DOMs are $175ea (may not need that much either, but I’d rather oversize the OS disk regardless), and the 256GB SATAs are $70ea. Those may be really oversized, but they’re a bargain by comparison; and the chassis has two 2.5 inch bays that I have no other objective for.

Yorick · Jul 22, 2020

Using both ports: Yes, and, with iSCSI, MPIO please, not LAG/LACP.

jgreco · Jul 22, 2020

So, comments, fine.

1) You would be better off with two L2ARC devices rather than one. Loss of a sole L2ARC device will tank performance (assuming you've got need of L2ARC in the first place). Plain L2ARC does not require power loss protection.

2) The X710 is *probably* okay but you might be better off with the more commonly used X520-{SR,DA}2 or the Chelsio card, simply because this is *known* to work swimmingly well. Do NOT use LACP (LAGG) and use MPIO.

3) The low core speed CPU's might not be the best choice. On the other hand, iSCSI is super efficient and doesn't require a lot of CPU. I typically choose higher core speeds rather than core count just because it's ugly if you run into core speed as a bottleneck.

4) The SSD vs SATA DOM thing doesn't bother me. I wouldn't get a 16GB or probably even a 32GB SATA DOM though. 64GB or larger SSD or SATA DOM as it suits you.

5) If you're already spending this much money, hopefully you've browsed through https://www.ixsystems.com/community/threads/the-path-to-success-for-block-storage.81165/ and linked articles. One thing to note is that under heavy fragmentation, you might want to plan for less "usable" pool. Your current number appears to be 50%, at which point you hit steady state performance that's pretty blah. If you can afford larger disks, the extra space can help reduce fragmentation somewhat and help keep performance up. See any of my fragmentation posts which discuss this.

kitt001 · Jul 22, 2020

jgreco said:
1) You would be better off with two L2ARC devices rather than one. Loss of a sole L2ARC device will tank performance (assuming you've got need of L2ARC in the first place). Plain L2ARC does not require power loss protection.

I selected the P4511 for its speed/durability, the power protection just happened to be there. If the only risk from failure is a performance tank, this isn't an enormous concern, it'd be more of an inconvenience. This backend will be hosting potentially hundreds of VM machine disks, however, it's a development/testing environment; so at any given time the majority of the machines are just sitting around waiting to be needed for something. If there was a hardware failure that slowed it down, shutting the box down to replace the part isn't a big problem. I could replace it with (2) 1TB NVMe's and keep the pricing pretty close to where it is now if that is a better option to remove the potential disruption. You also question the need for the L2 ... If I drop the L2 entirely, I could potentially shift that savings to system memory and get it to 384GB (I think 512 may be to far, but I'd need to re-math everything.

jgreco said:
2) The X710 is *probably* okay but you might be better off with the more commonly used X520-{SR,DA}2 or the Chelsio card, simply because this is *known* to work swimmingly well. Do NOT use LACP (LAGG) and use MPIO.

No Objection going to the X520 .. appears to be a bit cheaper anyway. The current intent is not to bond the interfaces, they are separate VLAN uplinks because they will go to separate Nexus switches and satisfy VMware's multi path expectation. VMware may do some sort of load balancing behind the scenes since it knows that they same repository is available via multiple paths, but it's not something that's obviously configurable. Because of the nature of the system, I don't anticipate exceeding the network capacity of the ports (we haven't in the past on the existing FreeNAS backend that's being replaced because it reached its storage capacity), but it is a good point for a future consideration. I may consider upgrading this to a 4 port card, or simply adding a second.

jgreco said:
3) The low core speed CPU's might not be the best choice. On the other hand, iSCSI is super efficient and doesn't require a lot of CPU. I typically choose higher core speeds rather than core count just because it's ugly if you run into core speed as a bottleneck.

Since the system uses the Xeon Scalable CPU's the cost of clock speed goes up pretty fast. I can move from the Bronze to Silver grade CPU's and eek out a bit more Clock speed, but there is no trade off where I could go down in cores and dramatically increase the clock, so the price just heads north. I can bump to 8-core 2.1GHz for roughly another $250 .. this seems like a good option based on your comment. The next step would be 2.5GHz for nearly +700, then 3.2 for roughly +1200 - these options would be nice, but may start to push the tentative budget I'm working with, unless I scale something back elsewhere.

jgreco said:
Your current number appears to be 50%, at which point you hit steady state performance that's pretty blah. If you can afford larger disks, the extra space can help reduce fragmentation somewhat and help keep performance up.

The number was the max I had to work with .. my intent was to actually provision 15TB, and I still have 16 bays left in the chassis ... so I figured that I could simply add additional vdevs and expand the pool long before it reached capacity. FWIW, we currently need about 5TB, and at our rate of growth, it *should* take nearly 2 years to double that. (Now, having said that, I fully expect my user base to make a liar out of me, and I'll be adding disks in 6 months).
I settled on the 6TB disks for the balance of size and the number of vdevs they created, If I go larger, I'd need to cut back on the total to keep the cost inline, which would reduce the number of vdevs. Going back to a pervious comment, if I get more or the same benefit from larger/fewer disks, I may be able to re-spec to a smaller chassis, save some cost there, and open up additional CPU options that may have higher clock rates.

jgreco said:
If you're already spending this much money, hopefully you've browsed through https://www.ixsystems.com/community/threads/the-path-to-success-for-block-storage.81165/ and linked articles.

I did read this, and one point that you made was the benefit of larger disks over faster spindles. I opted for the higher speed drives here figuring that because of the shear quantity of disks involved, there would probably be a benefit if the HBA could save some time here and there managing all the disks. Perhaps that is an error in thinking on my part .. if I move away from the 12Gb/s 7200 SAS disks, I could probably look at larger/slower disks without a huge net cost differential.

Important Announcement for the TrueNAS Community.

iSCSI performance question...

Dabbler

Dabbler

Never underestimate your own stupidity

Hall of Famer

Dabbler

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Dabbler

Dabbler

Never underestimate your own stupidity

Dabbler

He of the long foot

Dabbler

Dabbler

He of the long foot

Dabbler

Wizard

Resident Grinch

Dabbler

Similar threads