Build and performance tuning advice needed

Status
Not open for further replies.

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Please don't use autotune, turn it off and delete any tuning parameter that was auto-generated.

I actually asked him to turn it on to see if it matters. There are times when it should be on (like if a system is exhibiting bad behavior and potentially crashing). ;)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Forgot to mention this.

Standard ZFS policy is to never ever have the same device be used as an slog and l2arc. Both are opposites of each other, and the zfs scheduler does not expect that writes to the slog may compete with reads from the l2arc. In fact, you've significantly deminished the purpose of both because they are the same device, regardless of how fast they are.

I'd definitely go with one or the other, and if necessary buy more SSDs for better performance. One device for the slog, or for the l2arc. Not both. ;)

This could be the cause of some of your problems as FreeNAS is not tested to work in the configuration you are using (we don't test for it because it's not supported via the WebGUI, so you had to have done the configuring yourself somehow).
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
We seem to be well within the IOPS and bandwidth limits of the drives. Isn't that all the matters?
SLOG is very sensitive to latency. Sharing one SSD for SLOG and L2ARC inevitably increases latency, which will bring down your write performance. Don't forget that the actual NAND devices don't have separate read/write paths, so you can't do them in parallel. Similarly, SATA is half-duplex.

Larger SSDs are faster so using one large SSD for both should actually make for a faster SLOG device.
Bandwidth, possibly, in a somewhat convoluted scenario. Latency? Nope.

Larger SSDs last longer so using one large SSD for both saves money too!
You're also filling it more, so the effect is smaller to nonexistent, depending on the actual workload.

Maybe I need to revisit the decision but I purposely chose to use SATA SSDs. I wanted to be able to replace an SSD if necessary without taking the server out of production. There aren't any motherboards or chassis in my almost-no-money price range that can do hot swap PCIe cards.
That's what U.2 is for, though it may require chassis changes.

Also I don't think the older motherboards we're using support gen 3 PCIe which I suspect would be required?
Not required, but bandwidth is naturally halved. Not a big deal.
 

Carl Thompson

Dabbler
Joined
May 22, 2017
Messages
15
@Ericloewe , @cyberjock

Thank you both. OK, I will set up one of the servers with one SSD for SLOG and one SSD for L2ARC next week and see if there's a difference for us. I'm still a little skeptical, though. I have set up many ZFS servers this way over the last few years and never had an issue that seemed like it could be related to the SSD setup, the internet is full of many examples of others who do it this way and presumably they have no issues with it, and I've got 5 of these servers and they don't seem to have SSD performance issues because of it (though @cyberjock may disabuse me of that notion soon). From a logic standpoint my argument that if an SSD has the IOPS and bandwidth to do both sufficiently then it's OK seems valid to me. And my experience seems to bear that out so it seems like we're trying to solve a problem that I don't have. But maybe I'm wrong and I do have a problem but I just don't realize it.

Carl
 

Carl Thompson

Dabbler
Joined
May 22, 2017
Messages
15
Forgot to mention this.
This could be the cause of some of your problems as FreeNAS is not tested to work in the configuration you are using (we don't test for it because it's not supported via the WebGUI, so you had to have done the configuring yourself somehow).

Normally I build the pools by hand and import them into FreeNAS. The one you looked at was the original prototype and was actually built under Linux and later imported when we decided to switch to FreeNAS. However all the others were built under FreeNAS (from the command line). The FreeNAS built pools have the same server "vacation" problem though so I don't think the origin of the pool is a factor. For what it's worth here's the command we use to manually create the FreeNAS originated pools:
zpool create -o autoexpand=on -o autoreplace=on -o delegation=on -O atime=off -O compression=lz4 -m /mnt/data -O dedup=on data mirror da0p2 da1p2 mirror da2p2 da3p2 mirror da4p2 da5p2 mirror da6p2 da8p2 mirror da9p2 da10p2 mirror da11p2 da12p2 log mirror da7p1 da14p1 cache da7p2 da14p2 spare da13p2 da15p2

Thanks,
Carl
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
SLOG is very sensitive to latency. Sharing one SSD for SLOG and L2ARC inevitably increases latency, which will bring down your write performance. Don't forget that the actual NAND devices don't have separate read/write paths, so you can't do them in parallel. Similarly, SATA is half-duplex.


Bandwidth, possibly, in a somewhat convoluted scenario. Latency? Nope.


You're also filling it more, so the effect is smaller to nonexistent, depending on the actual workload.


That's what U.2 is for, though it may require chassis changes.


Not required, but bandwidth is naturally halved. Not a big deal.

Nice thing about running a drive which surges from 2-4GB/s on a 2GB/s bus is that it pretty much will run flat out at 2GB/ most of the time. At least for sequential access.
 

Carl Thompson

Dabbler
Joined
May 22, 2017
Messages
15
Do none of you think that the issue could simply be that I don't have enough ARC used for metadata as I theorize below?:

I wonder if 144GB of RAM just isn't enough with dedup on and 560GB of L2ARC. From what I understand only 1/4 of the ARC is allocated to metadata caching and I've read that both the dedup tables and L2ARC indices count as metadata. My total ARC size on these servers is about 120GB but if I'm calculating correctly my dedup table on one of the servers is about 40GB by itself. So I'd guess I either need to bump up the memory significantly or change the percentage of the ARC that can be used for metadata. What's the best way to do that? Is changing this ratio a good idea?

Or that I'm simply asking too much from NFS and iSCSI would give me more consistency?:

Another thought is that I may be beyond what NFS is capable of with FreeNAS and VMware. I've heard (here) that VMware's NFS implementation is a second class citizen and that iSCSI is the way to go for VMware. I'm ready to switch things to iSCSI if the consensus is that its performance and consistency is better for VMware (though this will involve some pain). If so then I'm also looking for best practices advice for iSCSI (I'm an iSCSI n00b). I also am considering the iSCSI switch for VAAI integration benefits.

Or that I simply need to tune the transaction group settings?:

Would it be advisable to combat the server "vacations" by tuning the maximum transaction group size or timeout lower? Wouldn't that increase fragmentation? Is fragmentation an issue I really need to worry about?

Thanks,
Carl
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Very few read requests actually seem to be making it past the ARC to the L2ARC. If you look at the gstat snapshot I posted you'll see that there are tons of writes to the SSDs but not a lot of reads. So there isn't really much contention.

This is a misunderstanding of contention. Contention has nothing to do with reads vs writes. Contention is a desire by the platform to be doing two (or more) things at the same time, and one of them having to be delayed because the single resource is currently busy. L2ARC write traffic is deliberately managed carefully in order to avoid serious amounts of contention with L2ARC reads. SLOG write traffic is naturally serial in nature so there's normally no contention for SLOG. However, when you put these two workloads next to each other, they are unaware of each other, and it is totally possible for the L2ARC subsystem to be trying to dump tons of data from the ARC to the L2ARC, while simultaneously one of your clients is trying to do a sync write, and because the mass dump to L2ARC is dominating the SSD's capacity, the SLOG traffic is delayed.

THAT's what contention is.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Hi, @jgreco, thanks for your response.



So would you recommend that I tune down the transaction group size and / or timeout then? What's a good place to start for that? Would doing this increase fragmentation?

Look for posts discussing vfs.zfs.txg.timeout

Fragmentation might increase, but fragmentation is always such a problem for block storage that you should already be designing your systems with an expectation that you'll hit steady state fragmentation. Have lots of pool free space and lots of L2ARC, basically,

Maybe I need to revisit the decision but I purposely chose to use SATA SSDs. I wanted to be able to replace an SSD if necessary without taking the server out of production. There aren't any motherboards or chassis in my almost-no-money price range that can do hot swap PCIe cards.

Yeah, there's that. Tradeoffs. Of course, if your chassis has extra SATA/SAS bays, you can still add a replacement SSD without downtime.

Also I don't think the older motherboards we're using support gen 3 PCIe which I suspect would be required?

I'm not aware of any such limitations. I believe we've used NVMe on PCIe-less-than-3 in the past but I don't know for sure. Hardware compatibility is always good to verify.

We originally tried Samsung consumer-level "Pro" SATA SSDs before settling on the S3710s. What we found is that while on paper they claim really high IOPS there's apparently a difference between IOPS and sustained IOPS. The Samsung Pro SSDs couldn't even keep up with our moderate needs and we ended up having to set sync=disabled on our VM datasets until we could replace them.

Sustained I/O numbers on *any* SSD seem to be incredibly overoptimistic. The Samy 850 PRO's are still basically a consumer-grade drive and don't compare to the S3710, which is a solid data-center class drive. I usually find that there's a little more reality if I drop an order of magnitude off the rating.

S3710s seem to be getting harder to find cheap so I will probably try some Samsung 863a SATA SSDs, though, as those are supposed to also be quite nice.

I'm impressed you ever found S3710's cheap.

True. But my ARCs currently have have a hit rate between 93% and 99% so I'm pretty sure I could saturate at least one 10GbE link even with my lowly SATA SSDs especially since I'm striping them for the L2ARC! (That is I could if the ixgbe driver in 10.3 weren't so horribly pathetic.) Going with PCIe or even more SSDs just doesn't seem to be required for our workload though things may change in the future.

Just throwing things out there. I'm not real thrilled about the evolution of the SSD marketplace, and I guess for tasks like this the future is probably leaning NVMe. I personally really like the 2.5" form factor SATA/SAS because it is so versatile. I can use them with RAID, etc. But if you were to actually have to go out and buy something, the NVMe has at least some substantial benefits, and similar cost, to SATA.

I forgot to mention that for our newer servers we are over-provisioning these SSDs by 25%. We secure erase the drives before partitioning them to only use 300GB out of 400GB. It's not necessary to pull the SSDs out and use Intel's SSD toolbox to secure erase. It can been right from FreeNAS using camcontrol.

Yes, but I don't recommend it for beginners. If you know it can be done and how to do it, then that's great.

These enterprise-level drives already have a good amount of over-provisioning built in but it's reasonably well established that over-provisioning by 20% or so more can gain you a bit of performance down the road after the drives have been used for a while. What's not well established is that over-provisioning by massive amounts gains you even more than over-provisioning by 25%. I suspect that the law of diminishing returns kicks in and any gains from massive overprovisioning vs, normal overprovisioning are negligible. So my opinion is buy a bigger SSD for SLOG because bigger SSDs are faster and last longer. Over-provision it normally but don't completely waste the space; you bought it you might as well use it!

Well, that *sounds* nice but is meaningless in practice. The only meaningful data ever stored on a SLOG device are the last two transaction groups. These are the things that will be recovered from SLOG upon pool import. You can choose to make the SLOG an exabyte large, if you want, but that'll all be wasted space except for the amount of storage required to store the last two transaction groups.

If you look at how an SSD operates under the sheets, what you really want is for the SSD to have a massive pile of already-erased blocks.

https://arstechnica.com/information...volution-how-solid-state-disks-really-work/3/

What slows an SSD down is when it has to go and do page shuffling to create blocks that can be erased. You already recognize that having free space on the disk leads to faster performance. If you have 300GB of that 400GB SSD filled with dirty sectors, that means you have 100GB of the advertised capacity (and probably ~150GB of actual underlying flash) that may be erased and ready-to-go. So if your clients start pounding writes, you can soak up ~100GB before you're shuffling pages.

My point is that if you instead only use 30GB of that 400GB SSD, you can soak up ~370GB before you're shuffling pages.

It's totally possible your SSD is fast enough to erase blocks at a rate sufficient to make sure that this doesn't turn into a significant impact. That's great if so. However, because there's no VALUE in consuming more than ~30GB, it seems to me that the smart move is to limit it to that size, let the controller have a huge pool of erased blocks, and then be assured of optimal behaviour. A larger SSD mostly gets you greater endurance, and reducing page shuffling is also a good thing for the longevity of an SSD.


Do none of you think that the issue could simply be that I don't have enough ARC used for metadata as I theorize below?:



Or that I'm simply asking too much from NFS and iSCSI would give me more consistency?:



Or that I simply need to tune the transaction group settings?:



Thanks,
Carl

Anything's possible. There are large numbers of moving parts in these systems, and it is entirely possible for there to be half a dozen with knives poised to stab you in the back, even if five of the six aren't your current pain point. Try to avoid discounting the advice you're getting. I haven't seen anything that's bad or wrong so far. Even if these suggestions do not fix your immediate issue, they're good knowledge.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I missed the dedup part. That tends to throw a wrench into any setup that should be working.

Are you actually getting any meaningful dedup ratios?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
5GBs RAM per TB for dedup...

With the cost of RAM vs storage you definitely want to be having good dedup ratios to justify the RAM usage.

Good news is improvements in ZFS dedup are planned.. hopefully they'll work well.
 

wreedps

Patron
Joined
Jul 22, 2015
Messages
225
All good info
 
Status
Not open for further replies.
Top