Maximizing Utility of SSD / Optane

RIPPEM

Dabbler
Joined
Feb 20, 2019
Messages
11
Folks,

I've been studying the trials and tribulations of ZFS for a solid week and think I have my gameplan worked out for my new FreeNAS box. The one thing I did want to run by folks with a lot more experience than me is the notion of using common physical SSDs to cover L2ARC for multiple pools and partitioning an Optane card to perform the same for multiple pool SLOGs. My idea on the L2ARC is based on a wider stripe of IOPs and bandwidth being available to service bursts if I took, for example, a 50GB partition on each of 6 SSDs to create my L2ARC instead of just dedicating a single 400GB SSD. This would allow the same group of 6 SSDs to service multiple pools, with each pool having access to a much much more robust L2ARC, albeit sharing IOPs across ZFS pools. This would also help me be decisive in making sure my L2ARC sizes are controlled and proportional to my system DRAM (96GB).

Along similar lines, the capacity and IOPs of the Optane card are way more than what a single pool in my environment would consume for SLOG, so I was thinking about overprovisioning it to get improved lifespan and then carving up the usable space into multiple partitions so it could handle SLOG for multiple pools. The FreeNAS box will be 4x8Gb FC (target) and 2x10Gb Ethernet attached.

At no point will I be mixing L2ARC and SLOG on common devices. The back end spinning rust varies by pool and is a hodgepodge of vdev / pool strategies, some mirrored/striped for performance, some raidz3 for data archival.

Does this seem like a reasonable approach or can someone see where I am missing a critical piece of design logic.
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
First step in L2ARC is maxing RAM in the box. What is the hardware specs you will be using?
SLOG is for SYNC=Always workloads, example: iSCSI. What is your planned usage?
 

RIPPEM

Dabbler
Joined
Feb 20, 2019
Messages
11
96 GB of RAM in this box. Workloads will vary, but ESXi will be accessing at least 2 of the pools on the box via iSCSI or FC. Definitely sync=always for servers, but maybe not so much for some dev/test VMs that can afford to die and be restored from VEEAM if something should blow up with the pool that houses them..

Figured I'd edit to add in more detail. This box is a dual, hex core E5 5650 on a Supermicro board. FC controller is Qlogic (don't have exact chipset handy but FC target mode works fine with it). HBA Controller is an LSI 9211-8i in IT mode feeding a 48 slot SAS2 expander. 2x10 Gig is a dual port Chelsio. SSDs are Pliant LB406S. Spinning rust is a mishmash of Hitachi Ultrastar SATA. The Pliant SSDs aren't super fast, so using them for SLOG is pointless but they would make a decent L2ARC. SLOG on Optane because I don't want it on the same controller and lanes as all those expander slots.
 
Last edited:

RIPPEM

Dabbler
Joined
Feb 20, 2019
Messages
11
96 GB of RAM in this box. Workloads will vary, but ESXi will be accessing at least 2 of the pools on the box via iSCSI or FC. Definitely sync=always for servers, but maybe not so much for some dev/test VMs that can afford to die and be restored from VEEAM if something should blow up with the pool that houses them..

Figured I'd edit to add in more detail. This box is a dual, hex core E5 5650 on a Supermicro board. FC controller is Qlogic (don't have exact chipset handy but FC target mode works fine with it). HBA Controller is an LSI 9211-8i in IT mode feeding a 48 slot SAS2 expander. 2x10 Gig is a dual port Chelsio. SSDs are Pliant LB406S. Spinning rust is a mishmash of Hitachi Ultrastar SATA. The Pliant SSDs aren't super fast, so using them for SLOG is pointless but they would make a decent L2ARC. SLOG on Optane because I don't want it on the same controller and lanes as all those expander slots.

Replacing an old 42 bay NexSAN SATABeast that just can't keep up anymore. Anemic cache, severely limited RAID, and no meaningful way to improve write speed on it. VFlash read cache has been propping it up for the last 2 years to improve read performance, but its time to retire it. I tried VSAN as an option and VMware can stuff that product somewhere dark and smelly where it'll feel right at home.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I've been studying the trials and tribulations of ZFS for a solid week and think I have my gameplan worked out for my new FreeNAS box.

Always happy to see someone who's done their research, and it looks like you've really gotten into the meat of things here. Welcome aboard; just be forewarned, I haven't found the bottom of the rabbit hole yet.

The one thing I did want to run by folks with a lot more experience than me is the notion of using common physical SSDs to cover L2ARC for multiple pools and partitioning an Optane card to perform the same for multiple pool SLOGs. My idea on the L2ARC is based on a wider stripe of IOPs and bandwidth being available to service bursts if I took, for example, a 50GB partition on each of 6 SSDs to create my L2ARC instead of just dedicating a single 400GB SSD. This would allow the same group of 6 SSDs to service multiple pools, with each pool having access to a much much more robust L2ARC, albeit sharing IOPs across ZFS pools. This would also help me be decisive in making sure my L2ARC sizes are controlled and proportional to my system DRAM (96GB).

Interesting concept. In this case, you're potentially sacrificing L2ARC performance consistency for burst read benefits. I don't know that real-world usage would bear this out as a win, versus just manually assigning the entirety of a drive or two to each pool, but I'm interested in the results. It's certainly technically doable, although it might necessitate manual/command-line setup. I'd also advise not using the entirety of each drive in order to preserve endurance and write performance (since you'll have to fill the drives before they're useful.)

As far as L2ARC size impacting memory your use of iSCSI means you do need to keep it in mind. Assuming a worst-case-scenario of 4KB blocks from Windows VMs translating directly into ZFS records, and each record costing about 40 bytes of RAM to index (it's around 80, but ARC is compressed) keeping 1TB of data in L2ARC would cost you 10GB of your RAM.

Along similar lines, the capacity and IOPs of the Optane card are way more than what a single pool in my environment would consume for SLOG, so I was thinking about overprovisioning it to get improved lifespan and then carving up the usable space into multiple partitions so it could handle SLOG for multiple pools.

The NAND on Optane means that you don't have to do the traditional LBA-limit overprovisioning (in fact it won't let you) - since the memory cells are able to overwrite-in-place it will wear-level very nicely on its own. Leaving some space is a good idea.

Since this is your potential bottleneck for writes though, sharing it does expose you to the potential of a "noisy neighbor" denial of service, if one of your zvols/datasets gets slammed with writes enough to overwhelm Optane (which based on the 4x8Gb FC + 2x10GbE is in the realm of possibility) it could choke out the other ones. So far I haven't found any consumer-available NVMe devices that can have QoS applied to their separate namespaces; if there was one, you could set something like "minimum guaranteed bandwidth" to each namespace (NVMe partition) to avoid this. It's a very unlikely situation, but for the sake of completeness, here it is.

At no point will I be mixing L2ARC and SLOG on common devices.

Good. The two workloads are very different, and mixing them on one device often makes it do a poor job of both. Although I could see Optane potentially being up to the task.

The back end spinning rust varies by pool and is a hodgepodge of vdev / pool strategies, some mirrored/striped for performance, some raidz3 for data archival. Does this seem like a reasonable approach or can someone see where I am missing a critical piece of design logic.

There are certain things that ZFS does/configures by default that are per-pool, so the count of pools could impact your memory/performance/stability. If you can elaborate on the pools a little that would be great.
 

RIPPEM

Dabbler
Joined
Feb 20, 2019
Messages
11
I completely understand the lack of QoS issue involved and that there are no mechanics that would allow me to either guarantee some kind of minimum IOPs to each or cap the IOPs from a chatterbox. Fortunately, I'm not running anything that is hyper sensitive to nominal disk IO delays and anything I do from here is nearly a guaranteed improvement from my current storage situation.

I'm envisioning 4 pools to start out:

1. will be a 4 vdev pool of 3 way 2TB mirrors shared via iSCSI/FC + 2 hot spares (Production Servers)
2. will be a single vdev of 6x8TB drives as RAIDZ2 as a VEEAM Repo + 1 hot spare (Online Backups, gets streamed out to tape regularly)
3. will be a single vdev of 6x900GB 10K SAS as RAIDZ2 + 1 hot spare (File Shares via SMB/AFP/NFS)
4. will be a 4 vdev pool of individual disks shared via ? (ESXi playground, completely disposable space, pool can fail at any time without repercussions).

The first 2 will be sync=always, the third I'm not decided on yet, and the 4th will depend on what I'm playing with at the time. SLOG/L2ARC is TBD for each.

If I can get good performance with Optane SLOG, I'd consider switching pool 1 to 2 vdev of 6x2TB RAIDZ2 to get better capacity efficiency.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I see a potential point of contention on the SLOG during VEEAM backups; the recordsize on that one should be at least the 128K defaults if not larger, and at that size Optane will be able to break 1GB/s. With no QoS it has the chance to choke out your active VMs. Setting the logbias=throughput on the backup dataset should help mitigate this.

My suggestion for the first pool would be that if you're doing regular backups, you could likely get away with running it in 2-way mirrors (6 vdevs of 2x2TB each) - this would give you significantly better raw pool performance versus the RAIDZ2 setup.

The other pools look fine, but having four pools on one machine could be an issue under a heavy simultaneous read/write load. The ZFS defaults will assign the lower of "10% of your RAM" or "4GB" as the maximum amount of dirty data allowed in RAM per pool. With the speed of your network connections (32Gb of FC and 20Gb of Ethernet) if all four pools decide receive a big batch of incoming writes at once, ZFS could suddenly be in a situation where it needs to find a whole pile of RAM in order to hold the new data. In order to do that, it'll dump records from ARC, which means your hit rate goes down and your disks get busier. Which means it takes longer to spool the dirty data to disk. And as more comes in, that means more RAM consumed, which means less for ARC, and around we go.

I imagine you'll bottleneck at your SLOG before that point hits (even the almighty Optane has its limits) but it's just something to be aware of when running a lot of pools on one machine. If multiple pools are sync=standard it carries a higher possibility since you're now only limited by the network speed.
 

RIPPEM

Dabbler
Joined
Feb 20, 2019
Messages
11
I really appreciate the input. Luckily, the disk usage patterns shouldn't introduce opportunities for all of the pools to get beat up at once. VEEAM nightlies will be the most significant steady state loading the system sees. With VEEAM in place, I did consider the possibility of using a stripe of two disk mirrors in the primary VM pool, but the downside is that IF that pool takes a dive, the instant restore from VEEAM by having it NFS host the VMs from its pool (6 disk RAIDZ2) is going to be murderously slow while it restores. 2TB drives aren't all that expensive these days, so if I need to get more write IOPS, I'll waste a little more space and a few more expander slots. I was mostly wondering if the combination of Optane SLOG with decent L2ARC and transaction aggregation prior to the periodic syncs might let me cheat a bit on the backend and still have decent overall performance.

You answered a large question I had with your statement about behavior when a massive ingress burst combining sync and async hits. Is there a limit to how much RAM can be reaped away from ARC by large incoming writes before the system starts to push back on the senders by delaying acknowledgements and such?

I could push the system up to 192GB of RAM, but I was really hoping to avoid getting into the much more spendy 16GB DIMMs.
 

RIPPEM

Dabbler
Joined
Feb 20, 2019
Messages
11
Guess I need to go get the ZFS books at some point. I won't be happy until I really understand the innards of this thing. :)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I really appreciate the input. Luckily, the disk usage patterns shouldn't introduce opportunities for all of the pools to get beat up at once. VEEAM nightlies will be the most significant steady state loading the system sees. With VEEAM in place, I did consider the possibility of using a stripe of two disk mirrors in the primary VM pool, but the downside is that IF that pool takes a dive, the instant restore from VEEAM by having it NFS host the VMs from its pool (6 disk RAIDZ2) is going to be murderously slow while it restores. 2TB drives aren't all that expensive these days, so if I need to get more write IOPS, I'll waste a little more space and a few more expander slots. I was mostly wondering if the combination of Optane SLOG with decent L2ARC and transaction aggregation prior to the periodic syncs might let me cheat a bit on the backend and still have decent overall performance.
Can't really argue against the extra insurance of the 3-way mirrors. If you've got the LFF bays and you're willing to use them, by all means carry on. You can always convert to 2-way mirror if you want later via the command line by zpool detaching a drive from each vdev.

As far as cheating on the back-end vdev performance, I wouldn't. You still need high-performance vdevs in order to be able to service cache misses and drain incoming txg's in a timely manner, and going down to 2x parity-RAIDZ vdevs is likely not going to give you what you need.

You answered a large question I had with your statement about behavior when a massive ingress burst combining sync and async hits. Is there a limit to how much RAM can be reaped away from ARC by large incoming writes before the system starts to push back on the senders by delaying acknowledgements and such?

I could push the system up to 192GB of RAM, but I was really hoping to avoid getting into the much more spendy 16GB DIMMs.

Yes, there's a number of tunables you can adjust for the write throttle behavior.

The main tunables are:

vfs.zfs.delay_min_dirty_percent: The limit of outstanding dirty data before transactions are delayed (default 60% of dirty_data_max)
vfs.zfs.dirty_data_sync: Force a txg if the number of dirty buffer bytes exceed this value (default 64M)
vfs.zfs.dirty_data_max_percent: The percent of physical memory used to auto calculate dirty_data_max (default 10% of system RAM)
vfs.zfs.dirty_data_max_max: The absolute cap on dirty_data_max when auto calculating (default 4G)
vfs.zfs.dirty_data_max: The maximum amount of dirty data in bytes after which new writes are halted until space becomes available (the lower of the percentage of system RAM or the absolute cap in dirty_data_max_max)

Unfortunately there isn't currently an option to tune these per-pool.

Defaults for your system would be 4G for each pool, with the throttle starting to apply at 60% or about 2457M. If you changed vfs.zfs.dirty_data_max_max to be 2G and left the 60% value alone, that would cause your throttle to kick in at about 1228M instead which might be too soon. I'd recommend leaving the defaults in place and using the dirty data script from Adam Leventhal's site (in spoiler below) to see what typical operation looks like, and then adjust downward accordingly.

Paste this code into a file in vi/nano and launch via dtrace -s dirty.d PoolNameGoesHere
Code:
txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}

Guess I need to go get the ZFS books at some point. I won't be happy until I really understand the innards of this thing. :)

I might figure out ZFS sometime before the heat death of the universe, but I'm not going to hold my breath. It's complicated, and every time I think I'm starting to get a good grip on it, I find more things to learn, more knobs to turn, and some suggestions to write up (like per-pool write throttle tunables!)
 

RIPPEM

Dabbler
Joined
Feb 20, 2019
Messages
11
Makes sense. I lucked out on a sweet deal for a brand new P3600 1.6TB NVMe card to handle L2ARC, so at least I'll be keeping both SLOG and L2ARC off of the SAS2 lanes and the latency of each nice and low. If all that doesn't quite get me where I want to be, I'll bite the bullet and go to 16GB modules to push RAM up to 192GB. Now I'm anxious to get the last few bits and put this thing together.

I'll bet it does a LOT better than the mixed workload average 3K IOPs I'm able to squeeze out of the NexSAN in its current configuration. :smile:

1GB of shared read/write cache and all spindles, no SSD... This is gonna be quite the step up. I bet monthly full backups don't take 13 hours to complete anymore. :)
 

RIPPEM

Dabbler
Joined
Feb 20, 2019
Messages
11
I made it through Adam Leventhal's post that you mentioned above and its a fantastic read on how OpenZFS handles dirty data and write throttling. Just wanted to thank you for replying with that. The dtrace tools are going to be a huge help for tuning, troubleshooting, and making sure that I have the components scaled appropriately for the workloads involved.
 
Last edited:
Top