iSCSI performance question...

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
I'm building out a box with a good deal of storage for use as the iSCSI backend to a VMware cluster ... I did this once before and the guidance I got from here was extremely beneficial - unfortunately, the last system I built has been outgrown.

Basic build here is big but straight forward:
  • At least 8 cores in the CPU
  • 128GB Memory
  • (2) 10GB SFP+ uplinks
  • (2) 240GB 6GB/s SATA SSD for OS
  • (2) 1.92TB PCIe M.2 NVMe SSD L2ARC
  • (2) 480GB 6GB/s SATA SSD ZIL
  • (18) 6TB 12GB/s SAS Spindles for the ZFS pool

The disks will be configured in 3-way mirrors yielding a final usable pool of 36TB, and an iSCSI volume of 18TB (all rough numbers). The main question is, will the 6TB disk mirrors kill the performance? Last time I went with 2TB disks to improve the performance, but there would be so many disks involved here, that the smallest disks I could really consider are 4TB.

I also considered an SSD backed RAIDz scenario ... but, I wasn't sure the performance would be there, plus, I don't think it could be extended as easily. Feel free to correct me.

Also, feel free to tell me I'm going about this all wrong and asking for trouble, I'm open to suggestions from the resident experts. Nothing is bought yet, so I can go a completely different direction.

Thanks
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Since ZFS is copy-on-write, for iSCSI, you want the largest disks possible, to have sufficient free space so you don't end up on the wrong side of the ZFS performance curve (rule of thumb is < 50% occupancy), and for IO, spread these over as many VDEVs as practical.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
This might prove helpful:
 

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
This might prove helpful:
Thanks, read through that and it largely reaffirmed what I learned during my last build.

I think my question though is more centered around the disk size impact ... I get that larger is better, so if that's the simple answer, then that's it; but the number of vdevs also plays in, and bigger disks makes fewer vdevs. Unless, I'm confused.

For example, below are several disk configurations that yield roughly the same end result storage wise. From a performance perspective, am I better off running more smaller disks?, or fewer larger ones? What's the trade off, or in this case, where's the sweet spot?
Qty/Size of DisksRaw capacityTotal 3-way mirrored vdevsOver all pool capacityUsable for block storage
24/4TB96TB832TB16TB
18/6TB108TB636TB18TB
12/8TB96TB432TB16TB
9/12TB108TB336TB18TB

Fewer disks means smaller chassis' and lower per TB cost, but performance is the true objective ... so if spending more is the better route, does the performance gain justify the increased cost ...
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Larger disks only really get you faster sequential access. For your use case, you'd be better off with more VDEVs for faster random access. I think your 18/6 TB line is probably your sweet spot.
 

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
Thanks ... is there much benefit to more cores in the CPU? Or is higher clock speed more of a win?
And is FreeNAS thead-aware?, or only core-aware?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
FreeNAS is thread-aware, with some tunables settings:

Code:
hw.vmm.topology.cores_per_package: 4 <- number of physical cores per package
hw.vmm.topology.threads_per_core: 2  <- number of threads per core


That said, iSCSI is not CPU limited, especially if you have NICs that can perform DMA.
 

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
Another performance related question .. again, looking for the balance. In the L2ARC, obviously, faster is better, but how fast is fast enough?
For example, again ...

I can put in a "1.92TB Samsung PM983 Series M.2 PCIe 3.0 x4 NVMe Solid State Drive" that runs at 480K Read IOPs and 42K Write IOPs
or
I could use a "Intel Optane SSD DC P4800X Series 375GB PCIe 3.0 x4 NVMe Solid State Addon Card" that runs at 550k Read IOPs and 500k Write IOPs.

The Intel will run faster, but it gives me a quarter of the cache at over twice the price. If money was no object, I'd get a 1.5TB Optane card and call it a day - but money is always an object, so I need to be confidant the boost I'm likely to see is going to make the added cost of one approach over the other worth it.
 

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
And along those lines, would I be better served running the somewhat slower L2Arc, and diverting the 'savings' to boost the system memory to 256GB
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
Thanks for the link, it has some useful information, but it largely seems to discuss the prospect of trying to have NVMe doing double duty for logging and caching, and how that would defeat the purpose of the NVMe; unless I misunderstood the course of the thread. It ultimately does point out the huge benefit of the L2Arc, which I'm not questioning, and does so using an Optane as reference ... but it doesn't really cover the scenario of "not quite as fast" NVMe, I don't think.

I'm far from an expert here, hence the questions, so forgive me if I'm completely wrong or off base; but logically, I figure, the process goes sort of like so ..
  1. Data requested
  2. Check L1Arc
  3. If not found, check L2Arc and load to L1Arc
  4. If still not found, load from disk in to L1Arc
  5. Data served from L1Arc (or at the same time it's being loaded there)
  6. as L1Arc overflows, write overflowing data to L2Arc (It's also possible that as data is loaded from the disks, that it is simultaneously loaded into both L1 and L2Arc caches, and L1 is just allowed to overflow .. which might make more sense)

Regardless of how/when the data gets to L2, the speed of the data being retrieved by the requester would theoretically be dictated by:
  1. the speed of L1
  2. the Read performance of L2 (not the Write performance, unless data was being read out faster than it could be written and buffers started getting backed up somewhere)
  3. the speed of the disk access.

If there is any truth to this, then it seems that a greater capacity of L2Arc that's close to lightning fast for reads would be a better option than a smaller capacity that actually is lightning fast .. especially if the capacity of the L1Arc was increased to reduce the calls to L2 further.

Please feel free to educate me here as I'm really trying to understand pros/cons so that I can spend money where it makes the most sense.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
You're on the right track if you seek maximum performance for iSCSI. To go further, and in answer to some of your questions:
  • Install the maximum RAM supported by your motherboard. The rule of thumb is to do this first before adding an L2ARC device -- because memory is always faster than flash or disk.
  • Install a ZIL SLOG device with low latency, fast write speeds, high endurance, and power loss protection (i.e., a built-in 'supercapacitor'). Provided it has the above characteristics, your NVMe device would be much better suited for this purpose than the standard SATA 480GB/s SSDs you specified above.
  • Install an L2ARC device that supports I/O at a higher rate than the spinning disks. Capacity doesn't need to be more than ~4 times the size of your installed RAM. The 480GB SSDs you listed above would work in this role, but ditching them and using another NVMe device instead would increase perfomance yet more.
 

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
The design as it stands has changed from the above ... here's where I'm at with more specifics...

A SuperMicro 6049P-E1CR36L chassis with an X11DPH-T motherboard and:
  • (2) Xeon 3206R 8-Core CPUs (1.9Ghz, 11MB)
  • 256GB 2933Mhz ECC Memory
  • (2) 256GB Micron SSDs (for the OS)
  • (1) 2TB Intel DC P4511 M.2 PCIe NVMe (for the L2Arc) (has Enhanced Power Loss Data Protection according to Intel's site)
  • (1) 375GB Intel Optane P4801X M.2 PCIe NVMe (for the ZIL/SLOG) (has Enhanced Power Loss Data Protection according to Intel's site)
  • (1) Intel X710-DA2 dual 10Gb/s SFP+ card
  • (20) 6TB SAS 12Gb/s 7200 disks (six 3-way mirrors plus two hot spares)
Anticipated yield is about 18TB for iSCSI storage behind a VMware cluster.

So, on topic, will it FreeNAS?

The next memory bump (512) adds another $1500 on top of an already expensive system, and while I'd love to "max" it out ... 2TB of memory runs around $20k, and the specs say the chassis is good for double that ... its not my money, but it's not a bottomless pit either :).

I'm still open to thoughts to course correct, and ideas if there is a better path worth looking at.
 

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
If I drop the 2TB P4511 for the L2, and go up to 512 System memory, I have a net cost change of about $900 ... but I feel like even though the L2 won't perform as well as the doubled L1, the greater capacity of the L2 may be more beneficial.
The VMDKs of a few hundred VMs will be living on this.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
The design as it stands has changed from the above ... here's where I'm at with more specifics...

A SuperMicro 6049P-E1CR36L chassis with an X11DPH-T motherboard and:
  • (2) Xeon 3206R 8-Core CPUs (1.9Ghz, 11MB)
  • 256GB 2933Mhz ECC Memory
  • (2) 256GB Micron SSDs (for the OS)
  • (1) 2TB Intel DC P4511 M.2 PCIe NVMe (for the L2Arc) (has Enhanced Power Loss Data Protection according to Intel's site)
  • (1) 375GB Intel Optane P4801X M.2 PCIe NVMe (for the ZIL/SLOG) (has Enhanced Power Loss Data Protection according to Intel's site)
  • (1) Intel X710-DA2 dual 10Gb/s SFP+ card
  • (20) 6TB SAS 12Gb/s 7200 disks (six 3-way mirrors plus two hot spares)
Anticipated yield is about 18TB for iSCSI storage behind a VMware cluster.

So, on topic, will it FreeNAS?

The next memory bump (512) adds another $1500 on top of an already expensive system, and while I'd love to "max" it out ... 2TB of memory runs around $20k, and the specs say the chassis is good for double that ... its not my money, but it's not a bottomless pit either :).

I'm still open to thoughts to course correct, and ideas if there is a better path worth looking at.
I like it!

The 256GB OS drives are vastly over-sized. Your proposed motherboard has two SATA DOM ports; you could populate these with 16GB SuperMicro devices. They also offer these in 32, 64, and 128GB capacity.

Disk I/O is usually a system's throughput bottleneck; with your proposed system, I'm wondering if the 10Gb/s NIC will be yours. With so many VMs you may actually derive benefit from using both ports in an LACP configuration.

Perhaps @jgreco will see this thread and give his insight; he has a lot of experience with high-end, high-performance systems.
 

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
Considered the DOMs, but the 128GB DOMs are $175ea (may not need that much either, but I’d rather oversize the OS disk regardless), and the 256GB SATAs are $70ea. Those may be really oversized, but they’re a bargain by comparison; and the chassis has two 2.5 inch bays that I have no other objective for.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Using both ports: Yes, and, with iSCSI, MPIO please, not LAG/LACP.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
So, comments, fine.

1) You would be better off with two L2ARC devices rather than one. Loss of a sole L2ARC device will tank performance (assuming you've got need of L2ARC in the first place). Plain L2ARC does not require power loss protection.

2) The X710 is *probably* okay but you might be better off with the more commonly used X520-{SR,DA}2 or the Chelsio card, simply because this is *known* to work swimmingly well. Do NOT use LACP (LAGG) and use MPIO.

3) The low core speed CPU's might not be the best choice. On the other hand, iSCSI is super efficient and doesn't require a lot of CPU. I typically choose higher core speeds rather than core count just because it's ugly if you run into core speed as a bottleneck.

4) The SSD vs SATA DOM thing doesn't bother me. I wouldn't get a 16GB or probably even a 32GB SATA DOM though. 64GB or larger SSD or SATA DOM as it suits you.

5) If you're already spending this much money, hopefully you've browsed through https://www.ixsystems.com/community/threads/the-path-to-success-for-block-storage.81165/ and linked articles. One thing to note is that under heavy fragmentation, you might want to plan for less "usable" pool. Your current number appears to be 50%, at which point you hit steady state performance that's pretty blah. If you can afford larger disks, the extra space can help reduce fragmentation somewhat and help keep performance up. See any of my fragmentation posts which discuss this.
 

kitt001

Dabbler
Joined
Jun 2, 2017
Messages
29
1) You would be better off with two L2ARC devices rather than one. Loss of a sole L2ARC device will tank performance (assuming you've got need of L2ARC in the first place). Plain L2ARC does not require power loss protection.

I selected the P4511 for its speed/durability, the power protection just happened to be there. If the only risk from failure is a performance tank, this isn't an enormous concern, it'd be more of an inconvenience. This backend will be hosting potentially hundreds of VM machine disks, however, it's a development/testing environment; so at any given time the majority of the machines are just sitting around waiting to be needed for something. If there was a hardware failure that slowed it down, shutting the box down to replace the part isn't a big problem. I could replace it with (2) 1TB NVMe's and keep the pricing pretty close to where it is now if that is a better option to remove the potential disruption. You also question the need for the L2 ... If I drop the L2 entirely, I could potentially shift that savings to system memory and get it to 384GB (I think 512 may be to far, but I'd need to re-math everything.

2) The X710 is *probably* okay but you might be better off with the more commonly used X520-{SR,DA}2 or the Chelsio card, simply because this is *known* to work swimmingly well. Do NOT use LACP (LAGG) and use MPIO.

No Objection going to the X520 .. appears to be a bit cheaper anyway. The current intent is not to bond the interfaces, they are separate VLAN uplinks because they will go to separate Nexus switches and satisfy VMware's multi path expectation. VMware may do some sort of load balancing behind the scenes since it knows that they same repository is available via multiple paths, but it's not something that's obviously configurable. Because of the nature of the system, I don't anticipate exceeding the network capacity of the ports (we haven't in the past on the existing FreeNAS backend that's being replaced because it reached its storage capacity), but it is a good point for a future consideration. I may consider upgrading this to a 4 port card, or simply adding a second.

3) The low core speed CPU's might not be the best choice. On the other hand, iSCSI is super efficient and doesn't require a lot of CPU. I typically choose higher core speeds rather than core count just because it's ugly if you run into core speed as a bottleneck.

Since the system uses the Xeon Scalable CPU's the cost of clock speed goes up pretty fast. I can move from the Bronze to Silver grade CPU's and eek out a bit more Clock speed, but there is no trade off where I could go down in cores and dramatically increase the clock, so the price just heads north. I can bump to 8-core 2.1GHz for roughly another $250 .. this seems like a good option based on your comment. The next step would be 2.5GHz for nearly +700, then 3.2 for roughly +1200 - these options would be nice, but may start to push the tentative budget I'm working with, unless I scale something back elsewhere.

Your current number appears to be 50%, at which point you hit steady state performance that's pretty blah. If you can afford larger disks, the extra space can help reduce fragmentation somewhat and help keep performance up.

The number was the max I had to work with .. my intent was to actually provision 15TB, and I still have 16 bays left in the chassis ... so I figured that I could simply add additional vdevs and expand the pool long before it reached capacity. FWIW, we currently need about 5TB, and at our rate of growth, it *should* take nearly 2 years to double that. (Now, having said that, I fully expect my user base to make a liar out of me, and I'll be adding disks in 6 months).
I settled on the 6TB disks for the balance of size and the number of vdevs they created, If I go larger, I'd need to cut back on the total to keep the cost inline, which would reduce the number of vdevs. Going back to a pervious comment, if I get more or the same benefit from larger/fewer disks, I may be able to re-spec to a smaller chassis, save some cost there, and open up additional CPU options that may have higher clock rates.

If you're already spending this much money, hopefully you've browsed through https://www.ixsystems.com/community/threads/the-path-to-success-for-block-storage.81165/ and linked articles.

I did read this, and one point that you made was the benefit of larger disks over faster spindles. I opted for the higher speed drives here figuring that because of the shear quantity of disks involved, there would probably be a benefit if the HBA could save some time here and there managing all the disks. Perhaps that is an error in thinking on my part .. if I move away from the 12Gb/s 7200 SAS disks, I could probably look at larger/slower disks without a huge net cost differential.
 
Last edited:
Top