Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.
Resource icon

A bit about SSD perfomance and Optane SSDs, when you're plannng your next SSD....

NOTE:

I'll be referring in this page, to a type of SSD developed by Intel and Micron, called 3D X-Point (pronounced "crosspoint"). It's most widely sold as Intel's Optane. The Optane devices I mean are things like the 900p, 905p, and P48xx. First generation optane, and some smaller ones aren't much use.

To see why I pick that kind of SSD out for performance handling, you need to know something about SSD sustained, random, and mixed, IO. That's what this page is about. A lot of reviews don't cover this stuff. But you're building a file server? You ought to be aware.....

I'll come back to Optane at the end.


"MY SSD SPECS ARE GREAT! AREN'T THEY...??"

As is widely known, 4k random reads and (more so) writes are probably the single worst load for spinning disks, they have to physically move the read-write heads and wait for spinning metal, to do it. But I've got SSDs, problem solved, right?

Wrong.

What's less well known is that with the current exception of Optane and battery-backed DRAM cards ONLY, mixed sustained reads and writes are are absolutely capable of trashing top enthusiast and datacentre SSD performance. Not all reviews publish that information, but try to find ones that do, when you choose a SSD for anything except pure SLOG (ZIL).

With almost every SSD, you get a classic "bathtub" curve of very good for pure reads and pure writes, but dreadful for mixed sustained RW. That Intel datacentre SSD that does 200k IOPS? Reckon perhaps as little as 10-30K IOPS when used on mixed RW 4k loads. As at 2020, only 2nd gen onwards Optane breaks that pattern. Not your Samsung Pro SSDs, not your Intel 750 or P3700 NVMe write-oriented datacentre SSDs. Optane and pure battery backed RAM cards only.

I should clarify: That's nothing to do with SSDs having too-small DRAM or SLC cache. It's inherent in the SSD NVRAM chips themselves. Because it's nothing to do with the device cache type or size, a "better" SSD or one with "better" or no cache, won't help much.

The problem is that the basic NVRAM chips SSDs are made from, are awkward to write and erase. That means they run slow, need caching, need a clever controller to maximise their workings and cover over the slow bits where possible.

If I refer specifically to Optane now and then, in resources I write, that's why. Not because I'm a fan-boi, but because they use a different technology that's largely immune to a fundamental NVRAM problem that negatively and severely affects every other SSD on the market.

Just look at these graphs..... and remember: Intel are usually reckoned to make the best datacentre SSDs. In these graphs, Intel and Tom's Hardware are directly comparing Optane to Intel's top-performance non-Optane, the P3700 NVMe. Also remember current Optane is fast enough at raw IO, to not need caching, and to almost be power loss safe, without specialised capacitors and circuitry.



Now another graph. This is the output of zpool iostat -pw, the I/O latency and responsiveness of my pool, during a really heavy workload. Low latency is important, but it's really crucial for sync writes/ZIL, as well as dedup metadata.

Optane.png

Spot the Optane? At least, I *think* that's the Optane, I can't imagine the sync write latency reflects anything except ZIL/SLOG time. Look how beautifully tight the clustering is on the sync write queue! Not just extremely low, but uniformly extremely low. In fact, as we can see, it never gets to be anything other than extremely low!

The rule of thumb in ZFS is, don't have one physical device doing 2 workload roles (even if it's mirrored or redundant). But as we can see, this doesn't apply to Optane. The workload above is with mirrored Optane 905p's, partitioned to host both ZIL/SLOG, and also a Special Metadata vdev. End result - it's still awesome, even running both workloads. To remind you, that means the same SSDs doing 255ns to 4usec latency for ZIL/SLOG, are * also* handling metadata and dedup table I/O at full speed, during an intensive session.

(If you're interested, that dump is from a 35 TB local zfs send | recv replication, and the average pool replication rate is around 400 MB/sec on a quad core Xeon. I could get it faster with more exact tuning, but didn't see a point.)

One last latency output. This one is 28 hours (24 TB) into the 35 TB replication:

latency3.png

Notice the hugely different latencies. I've configured it to treat any disk I/O that takes longer than 200 milliseconds as being "slow" (sysctl vfs.zfs.zio.slow_io_ms=200). You can then see the stats using zpool status -s. It's intended to test for gradually failing drives but it's a good way to find drives that are being queued up so much that they can't respond quickly, and how often it's happening.

  • The pool being *read* was created under 11.3. Its hard drives have typically reported around 0.8 - 2.4 million slow access events.
  • The HDDs in the newly created 12.0 pool are exactly the same model of HDD. They are also 2 way mirrors, although 4 not 3 HDD vdevs. But they are writing which is typically quite a bit slower than reading. They also have full use of the OpenZFS 2.0 tunables, features, and special vdevs for metadata and dedup. (both pools are fully deduped). We can allow that they have fewer accesses, because they can be written in large chunks not piecemeal laid down over time. After allowing for all of those, the OpenZFS 2.0 pool has a typical "slow access" event rate of about 13-18 thousand per device.
  • The Optane special vdevs in the 12.0 pool are reporting just 900 slow events, or ~ 6% of the HDD slow event rate in that pool. That's despite handling all file and pool metadata, and all dedup data. (Although a lot of that is cached in ARC so it doesn't need reading back as much)

Short version, the 12-beta HDDs were "slow" (>200 ms) for disk I/O at about 1% of the rate that the 11.3 pool is doing. Because it's reporting total event counts, it also confirms they are running a lot more consistently low rates regarding latency, not just running at low rates.

I'm still fine tuning, but one other gain. A mass recursive delete of 957 snapshots, across 17 datasets (16269 snaps total) just took 48 minutes. Thats a dedup snap destroy rate down from tens of seconds per snap, before special vdev+Optane, to 1/6 of a second each after moving to 12-BETA and adding them.

That gain is almost entirely down to one reason: the metadata and dedup tables have been offloaded to a fast SSD vdev in 12-BETA, which can't be done in 11.3 or earlier.

Big win!

WHAT DOES THIS MEAN IN REALITY?

In no specific order.....
  1. Choose your SSDs carefully. Your choice of SSD needs careful research. Good read and write specs are part of the story. Try to find a review that covers sustained mixed RW as well, in your decision.
  2. Be cautious using SSDs that have great stats when caching, but tank under ongoing use when cache is full. SSD designs that use caching, will speeds up the first part of transfers. After they fill their cache, it's often a different story. Whatever speed you get, well..... thats what you get. And it's not always great under load.
  3. Consider your server's workload, and whether mixed fast IO is a feature or a barrier. If you need (or want) ultra high speed and low latency, the options as of 2020 are literally, 1) First tier, Optane 900p/905p/P48xx series, or battery backed ddrram if you find it, 2) Very much second tier, absolutely everything else on the market.
    If you also need good sustained mixed RW and smaller blocks as well, Optane can be a different planet of consistent high performance. There are very good SSDs that will work well in a datacentre, or a home server. Anything Intel, and anything Samsung (especially PRO or at least EVO, not QVO!), are pretty much everyone's "most reliable bet". But no matter how the graphs for your SSD show that at low queue depth some SSD will be awesome, ask yourself what about mixed IO, if it applies to you, and check out reviews for your SSD's sustained mixed RW 4k IO. Some SSDs will be good enough, many will let you down.

    Optane specific:
  4. Use Optane for ultra low latency roles - ZIL/SLOG, and metadata - if you need to. As a side effect of Optane's technology, not having awkward read-erase-write cycles means that it's very very low latency - use Optane without question for ZIL (needs superb latency) and dedup special vdevs (need very efficient 4k mixed IO), and also possibly for metadata and L2ARC.
  5. If you plan to use Optane, consider buying 2 optane and splitting them (using part of each as mirrors). Its fast and reliable enough to handle it, and will save you cost on a pretty expensive SSD.
  6. More specifically, Optane devices are probably the only good exception right now to the ZFS rule of thumb, about never using a single device for multiple ZFS roles. Because of their exceptional capability under those tricky ultra-low latency mixed random I/O's.
    You can validly use partitioned mirrored Optanes for multiple of ZIL/SLOG, L2ARC, and Special Metadata/Dedup Table Vdev. Especially since ZIL/SLOG often only needs to be small. No need to burn extra cash for 2 more Optanes.
    Pro tip: if you have an Optane Special Vdev, there isn't much point also putting L2ARC on it as well. The server will pull the raw data off the pool's own metadata vdev fast enough that you won't see benefit from having L2ARC for it. If you use tunables to reserve L2ARC for pool file data, then there could be an argument for Optane L2ARC, as it isn't duplicating the Optane Special Vdev.
  7. Consumer Optane isn't officially "power loss protected", but its extremely close. Probably good enough to count as safe, for most of us. A second side effect is, while consumer Optane arent officially "power loss protected", they are so close to it, because of the lack of caching and incredibly fast write times, that many home server users and small businesses, can treat them as such. I gather they were once listed as power loss protected by Intel, and later this designation was withdrawn. But they are expensive. You may not need them.
  8. Optane cached NVRAM isn't the same as Optane SSD. Some cheap SSDs use a small amount of Optane as a cache for a larger amount of traditional MLC/TLC NVRAM storage. Those may not get you what you need. See the warning about caching on SSDs under sustained loads.
Author
Stilez
Views
313
First release
Last update
Rating
4.00 star(s) 1 ratings

More resources from Stilez

Latest reviews

Good short writeup.

To get it completely top-grade:
Some more sources, there are some GREAT comparison sources between the different optane ssd's for use as a SLOG, which perfectly highlight your conclussion :)
Top