Resource icon

A bit about SSD perfomance and Optane SSDs, when you're planning your next SSD....

NOTE:

I'll be referring in this page, to a type of SSD developed by Intel and Micron, called 3D X-Point (pronounced "crosspoint"). It's most widely sold as Intel's Optane. The Optane devices I mean are things like the 900p, 905p, and P48xx. First generation optane and some smaller ones aren't much use.

To see why I pick that kind of SSD out for performance handling, you need to know something about SSD sustained, random, and mixed, IO. That's what this page is about. A lot of reviews don't cover this stuff. But you're building a file server? You ought to be aware.....

I'll come back to Optane in a bit. Then I'll move on, to discuss which brands and models of SSD to consider (or not!), what form factor/connections, special vdev ssd sizing and structure, as those are probably quite connected to this topic.


"MY SSD SPECS ARE GREAT! AREN'T THEY...??"

As is widely known, 4k random reads and (more so) writes are probably the single worst load for spinning disks, they have to physically move the read-write heads and wait for spinning metal, to do it. But I've got SSDs, problem solved, right?

Wrong.

What's less well known is that with the current exception of Optane and battery-backed DRAM cards ONLY, mixed sustained reads and writes are are absolutely capable of trashing top enthusiast and datacentre SSD performance. Not all reviews publish that information, but try to find ones that do, when you choose a SSD for anything except pure SLOG (ZIL).

With almost every SSD, you get a classic "bathtub" curve of very good for pure reads and pure writes, but dreadful for mixed sustained RW. That Intel datacentre SSD that does 200k IOPS? Reckon perhaps as little as 10-30K IOPS when used on mixed RW 4k loads. As at 2020, only 2nd gen onwards Optane breaks that pattern. Not your Samsung Pro SSDs, not your Intel 750 or P3700 NVMe write-oriented datacentre SSDs. Optane and pure battery backed RAM cards only.

I should clarify: That's nothing to do with SSDs having too-small DRAM or SLC cache. It's inherent in the SSD NVRAM chips themselves. Because it's nothing to do with the device cache type or size, a "better" SSD or one with "better" or no cache, won't help much.

The problem is that the basic NVRAM chips SSDs are made from, are awkward to write and erase. That means they run slow, need caching, need a clever controller to maximise their workings and cover over the slow bits where possible.

If I refer specifically to Optane now and then, in resources I write, that's why. Not because I'm a fan-boi, but because they use a different technology that's largely immune to a fundamental NVRAM problem that negatively and severely affects every other SSD on the market.

Just look at these graphs..... and remember: Intel are usually reckoned to make the best datacentre SSDs. In these graphs, Intel and Tom's Hardware are directly comparing Optane to Intel's top-performance non-Optane, the P3700 NVMe. Also remember current Optane is fast enough at raw IO, to not need caching, and to almost be power loss safe, without specialised capacitors and circuitry.


optane1-png.40217
optane2-png.40218

Now another graph. This is the output of zpool iostat -pw, the I/O latency and responsiveness of my pool, during a really heavy workload. Low latency is important, but it's really crucial for sync writes/ZIL, as well as dedup metadata.

Optane.png

Spot the Optane? At least, I *think* that's the Optane, I can't imagine the sync write latency reflects anything except ZIL/SLOG time. Look how beautifully tight the clustering is on the sync write queue! Not just extremely low, but uniformly extremely low. In fact, as we can see, it never gets to be anything other than extremely low!

The rule of thumb in ZFS is, don't have one physical device doing 2 workload roles (even if it's mirrored or redundant). But as we can see, this doesn't apply to Optane. The workload above is with mirrored Optane 905p's, partitioned to host both ZIL/SLOG, and also a Special Metadata vdev. End result - it's still awesome, even running both workloads. To remind you, that means the same SSDs doing 255ns to 4usec latency for ZIL/SLOG, are * also* handling metadata and dedup table I/O at full speed, during an intensive session.

(If you're interested, that dump is from a 35 TB local zfs send | recv replication, and the average pool replication rate is around 400 MB/sec on a quad core Xeon. I could get it faster with more exact tuning, but didn't see a point.)

One last latency output. This one is 28 hours (24 TB) into the 35 TB replication:

latency3.png

Notice the hugely different latencies. I've configured it to treat any disk I/O that takes longer than 200 milliseconds as being "slow" (sysctl vfs.zfs.zio.slow_io_ms=200). You can then see the stats using zpool status -s. It's intended to test for gradually failing drives but it's a good way to find drives that are being queued up so much that they can't respond quickly, and how often it's happening.

  • The pool being *read* was created under 11.3. Its hard drives have typically reported around 0.8 - 2.4 million slow access events.
  • The HDDs in the newly created 12.0 pool are exactly the same model of HDD. They are also 2 way mirrors, although 4 not 3 HDD vdevs. But they are writing which is typically quite a bit slower than reading. They also have full use of the OpenZFS 2.0 tunables, features, and special vdevs for metadata and dedup. (both pools are fully deduped). We can allow that they have fewer accesses, because they can be written in large chunks not piecemeal laid down over time. After allowing for all of those, the OpenZFS 2.0 pool has a typical "slow access" event rate of about 13-18 thousand per device.
  • The Optane special vdevs in the 12.0 pool are reporting just 900 slow events, or ~ 6% of the HDD slow event rate in that pool. That's despite handling all file and pool metadata, and all dedup data. (Although a lot of that is cached in ARC so it doesn't need reading back as much)

Short version, the 12-beta HDDs were "slow" (>200 ms) for disk I/O at about 1% of the rate that the 11.3 pool is doing. Because it's reporting total event counts, it also confirms they are running a lot more consistently low rates regarding latency, not just running at low rates.

I'm still fine tuning, but one other gain. A mass recursive delete of 957 snapshots, across 17 datasets (16269 snaps total) just took 48 minutes. Thats a dedup snap destroy rate down from tens of seconds per snap, before special vdev+Optane, to 1/6 of a second each after moving to 12-BETA and adding them.

That gain is almost entirely down to one reason: the metadata and dedup tables have been offloaded to a fast SSD vdev in 12-BETA, which can't be done in 11.3 or earlier.

Big win!

A LITTLE EYE CANDY:

One statistic in the meantime....

IMG_20200813_181408_648.jpg


That 1/4 million 4k random IOPS is why special vdevs with good SSDs are essential for dedup.

But it's also why that "bathtub curve" really matters. Your SSDs really might need to pull that kind of IOPS under load... and it could easily be mixed IOPS (although that screenshot iswrite only.

As a test, I just ran 2 dedup pools, one an Optane special vdev mirror, one a Samsung 970 Evo Plus special vdev mirror. The Samsungs top out at about 2,300-2,450 IOPS write. The other pool with the twin Optanes, is doing 40,000 - 100,000 IOPS write with them, and they aren't even being pushed as we can see from their real capability above. The Samsungs aren't bad SSDs, and if I'm on a budget I'd say you can use them. Theyt'll work enough to make dedup viable. But be aware that even good consumer SSDs cant do the sameclass of performance. They are still slowing my pool down - a lot.


WHAT DOES THIS MEAN IN REALITY?

In no specific order.....
  1. Choose your SSDs carefully. Your choice of SSD needs careful research. Good read and write specs are part of the story. Try to find a review that covers sustained mixed RW as well, in your decision.
  2. Be cautious using SSDs that have great stats when caching, but tank under ongoing use when cache is full. SSD designs that use caching, will speeds up the first part of transfers. After they fill their cache, it's often a different story. Whatever speed you get, well..... thats what you get. And it's not always great under load.
  3. SSD (or Optane) cache on a hybrid drive is not the same as an SSD or Optane device. If you've decided you need SSD/Optane for special vdev or something, or a demanding pool, don't assume a hybrid will save money. It'll probably be slowed down to HDD speeds when the cache is full. Which will be easy to happen. These disks are NOT good for heavy duty random 4K IO. If caching could help, ZFS ARC or L2ARC would do it. If they cant, then onboard drive cache on a hybrid HDD will do literally, zero.
  4. Consider your server's workload, and whether mixed fast IO is a feature or a barrier. If you need (or want) ultra high speed and low latency, the options as of 2020 are literally, 1) First tier, Optane 900p/905p/P48xx series, or battery backed ddrram if you find it, 2) Very much second tier, absolutely everything else on the market.
    If you also need good sustained mixed RW and smaller blocks as well, Optane can be a different planet of consistent high performance. There are very good SSDs that will work well in a datacentre, or a home server. Anything Intel, and anything Samsung (especially PRO or at least EVO, not QVO!), are pretty much everyone's "most reliable bet". But no matter how the graphs for your SSD show that at low queue depth some SSD will be awesome, ask yourself what about mixed IO, if it applies to you, and check out reviews for your SSD's sustained mixed RW 4k IO. Some SSDs will be good enough, many will let you down.

    Optane specific:
  5. Use Optane for ultra low latency roles - ZIL/SLOG, and metadata - if you need to. As a side effect of Optane's technology, not having awkward read-erase-write cycles means that it's very very low latency - use Optane without question for ZIL (needs superb latency) and dedup special vdevs (need very efficient 4k mixed IO), and also possibly for metadata and L2ARC.
  6. If you plan to use Optane, consider buying 2 optane and splitting them (using part of each as mirrors). Its fast and reliable enough to handle it, and will save you cost on a pretty expensive SSD.
  7. More specifically, Optane devices are probably the only good exception right now to the ZFS rule of thumb, about never using a single device for multiple ZFS roles. Because of their exceptional capability under those tricky ultra-low latency mixed random I/O's.
    You can validly use partitioned mirrored Optanes for multiple of ZIL/SLOG, L2ARC, and Special Metadata/Dedup Table Vdev. Especially since ZIL/SLOG often only needs to be small. No need to burn extra cash for 2 more Optanes.
    But don't use the same SSD for two mirrors on a VDEV - it won't be redundant because if you lose the SSD you lose both not one.
    Pro tip: if you have an Optane Special Vdev, there isn't much point also putting L2ARC on it as well. The server will pull the raw data off the pool's own metadata vdev fast enough that you won't see benefit from having L2ARC for it. If you use tunables to reserve L2ARC for pool file data, then there could be an argument for Optane L2ARC, as it isn't duplicating the Optane Special Vdev.
  8. Consumer Optane isn't officially "power loss protected", but its extremely close. Probably good enough to count as safe, for most of us. A second side effect is, while consumer Optane arent officially "power loss protected", they are so close to it, because of the lack of caching and incredibly fast write times, that many home server users and small businesses, can treat them as such. I gather they were once listed as power loss protected by Intel, and later this designation was withdrawn. But they are expensive. You may not need them.
  9. Optane cached NVRAM isn't the same as Optane SSD. Some cheap SSDs use a small amount of Optane as a cache for a larger amount of traditional MLC/TLC NVRAM storage. Those may not get you what you need. See the warning about caching on SSDs under sustained loads.

SSD BRANDS AND MODELS TO CONSIDER:

As of August 2020, if you can afford it, Optane 900p or 905p range for special vdevs. Expensive but by now you'll know why.

If not, then Intel anything NVMe/data centre, or Samsung Pro SSDs (ideally not EVO and NEVER QVO or PM series and other OEM models, as these have great read rates and very poor mixed+write rates, or use cache to mask serious QLC NVRAM drawbacks). Both are very good. Widely regarded as the most dependable SSDs for solid serious use, out there. 2nd hand EBay is fine for those.

Aim for SSD models that self-describe as NVMe, U.2, or M.2 "M" key (only!) - it doesnt matter if they are tiny SSD cards, PCIe devices or 2.5 inch form factors, as long as they are one of those they will work at NVMe speeds. Avoid if possible Sata, M.2 "B" key, and M.2 SATA, but that said at a pinch a good SATA SSD from the above model ranges will still make a hell of a diffeence over HDD and might be enough. I didn't bother trying those, however.

PCIe and M.2 SSDs and adapters can come in both NVMe and SATA variants. To be sure which it is, check if it says NVMe/"M" key/U.2 (it's NVMe) or if it says SATA/"B" key (it's Sata). Also if it can push more than about 125k IOPS or 800 MB/sec of data, it's also probably NVMe as well, those are beyond what Sata can do.

U.2, M.2 "M" key, and anything that explicitly says it's NVMe, uses PCIe 4x for their connectivity, so the adapter is literally just connecting their different layout electrical contacts - its all the same, electrically. M.2 "B" key and Sata disks use SATA signalling on the PCIe bus and standalone Sata buses, respectively. They aren't the same.

As at August 2020, I wouldn't consider anything else, and you shouldn't need to. These SSDs will have to handle.long term sustained mixed 4k RW loads and heavy levels of data writes, and most SSDs won't deliver their headline figures on that kind of activity for long periods. Or have less dependable controllers. The SSDs you use, need to be fast enough and low enough latency that they almost don't need internal DRAM caching, or at least dont suffer tooo badly when its fully used, otherwise when the SSD cache fills, performance can plummet (look for reviews with sustained R/W/mixed IO charts).

Which is basically, Optane 900p/905p first by a dozen miles, then Intel datacentre (P37xx or at a pinch 750,P36xx, all cheap 2nd hand), then Samsung 850 ~ 970 Pro only because they are built like tanks for reliability and efficiency among enthusiast SSDs, then if desperate only Samsung 850 ~ 970 EVO as a last ditch resort. Among all those, NVMe, M.2 "M" key/PCIe if at all able. And nothing else.

(OK, one exception: Mayyyyyybe Intel P48* Optane if you need datacentre guarantees, have thousands to spend, and 900p/905p isn't good enough!)

Last, for performance on a demanding pool, do not be tempted to shortcut with hybrid drives (HDD with a few tens GB of SLC or Optane cache to speed it up). The sheer random 4K IOPS demand on these devices will mean that the Optane cache is hopelessly undersized and will mostly end up only able to work at HDD speeds (the speed of the backing storage it's caching from). You need the entire storage at SSD speeds
.
As for power loss, honestly unless you're mission critical the worst that will happen is it'll roll back a little, the last small amount. But that's what HDDs traditionally do on power loss anyway. Graceful handling (whats been written staying good and readable on reboot, even in bad poweroff situations) matters a lot more. Most modern controllers do that. Optane is so fast it almost doesn't need any power loss mitigation anyway, so another plus.

SPECIAL VDEV SIZING:

As at August 2020, assume you may need them bigger than you think. I'm trying to understand why. My pool has 190M DDT entries, at ~750 bytes each according to ZFS (zpool status -Dv) and regular metadata is said to typically be 0.1-0.2% of pool size, at 40 TB,so I expected it to take maybe 200 or so GB but its well over 300 GB - I don't know how much by, or why, yet. And that's just at this time, newly rewritten and unfragmented, not including future data and fragmentation. So err with a lot of caution on size - but you can always add a 2nd special vdev later. The giveaway will be special vdev at about 76% or 86% full, in zpool list -v capacity column. That says ZFS has stopped trying to fill it - there's a tunable for how much a vdev will be filled if the others aren't that full, usually 75% I think? I increased mine to 85%. After that it'll begin dumping metadata back on HDD instead.

The problem then is even if you buy another SSD vdev, the metadata/DDT that *already* spilled onto HDD wont be moved unless its redundant and not needed, i.e., never. Or until you SSD-ise or replicate/rewrite your entire HDD pool (remove wont work as its striped across ALL HDDs once it no longer fits on special vdevs). So you really don't want to get close to special vdev capacity if avoidable, because you won't know that's happening until it's already happening.


SPECIAL VDEV STRUCTURE:

As this is crucial data, mirrors always, for redundancy and efficiency (and resilver speed if needed). And only mirrors. You do *not* want a special vdev on parity raidz. If the webUI says it wants them raidz, override it.
Author
Stilez
Views
66,016
First release
Last update
Rating
4.00 star(s) 1 ratings

More resources from Stilez

Latest reviews

Good short writeup.

To get it completely top-grade:
Some more sources, there are some GREAT comparison sources between the different optane ssd's for use as a SLOG, which perfectly highlight your conclussion :)
Top