Sanity check on test SLOG performance

PhiloEpisteme · Aug 21, 2019

So I'm looking into whether a SLOG device would be worth the money to me or not. I did some tests, which I will describe below and am somewhat surprised by the results. Specifically, with my test SLOG device it seems I was not able to get close to saturating my 1G network. If folks have suggestions for better ways to run these tests or settings I can change to improve performance please, let me know.

My system

FreeNAS Release: FreeNAS-11.2-RELEASE-U2
Board: SuperMicro X11SSM-F
Processor: Intel Core i3-7100 @ 3.90GHz, 2 Cores
Memory: 32 GB of 2x16GB
HBA: LSI/Broadcom SAS9207-8i, 6Gbps SAS PCI-E 3.0 HBA, Firmware 20.00.07.00
Storage Pool 1: 1 vdev
vdev-0 RAIDZ2:
2x 7200RPM 3TB Seagate Constellation ST3000NM0033
4x 5900ROM 3TB Seagate IronWolf NAS ST3000VN007
Storage Pool 2 (encrypted): 1 vdev
vdev-0 RAIDZ2:
3x 7200RPM 2TB mixed desktop drives
3x 5900ROM 2TB Seagate IronWolf NAS ST2000VN007
Test SLOG devices: 2x 240GB Samsung 970 EVOs plugged into a Supermicro AOC-SLG3-2M2 PCIe3.0x8 -> 2x M.2 @ PCIe3.0x4. No PLP protection so obviously poor long-term SLOG devices but I had them lying around to test with.

The Tests

I did not write or use a script to process the data automatically. I took samples and averaged for network speed tests and watched values and recorded the stable range for both io and cpu idle %. All tests were done while both machines were connected via a router using wired 1G connection.

Network performance*: netstat -w 1 -I igb0. 2-3 sets of 20-30 values were taken, averaged, and converted to Mb.
Disk performance: zpool iostat -v <pool> 1
CPU performance**: top -P
Large file transfer: Using an NFS share I transferred a directory to the pool containing 10 1.024GB files created with for n in {1..10}; do dd if=/dev/urandom of=${n}.txt bs=64000000 count=16; done. Time to complete was recorded by simple timing with a digital watch.
Small file transfer: Using an NFS share I transferred a directory to the pool containing 10000 1.024MB files created with for n in {1..10000}; do dd if=/dev/urandom of=${n}.txt bs=1024 count=1000; done. Time to complete was recorded by simple timing with a digital watch.

* Network speed for the first benchmark was taken directly from iperf.
** I only performed cpu benchmarks where listed. Given the low idle percentage and similar network speeds of both pools with the SLOG I don't think the cpu is the bottleneck.

Initial Benchmarks

Code:

[client] $ iperf3 -c <ip> -F <1.024G file> -f k
[server] $ iperf3 -s

Result: 935Mb/s

Code:

$ mdconfig -a -t swap -s 6g -u 1
$ zpool add <pool> log md1

Large file results
Network: 840Mb/s
Disk: pool 170-240MB/s; vdev 70-130MB/s; SLOG 100-110MB/s
CPU: 80-85% idle
Time: 1:41

Small file results
Network: 670Mb/s
Disk: pool 130-210MB/s avg 150MB/s; vdev 65-120MB/s avg 70MB/s; 75-85MB/s
CPU: 80-85% idle
Time: 2:06

These values looked pretty reasonable to me. For large files I was nearly able to saturate my 1Gb/s network. Considering the encoding, router, etc I would guess that with the ramdisk the network was the bottleneck.

Actual Tests

Without SLOG

Encrypted Pool Large file results
Network: 323Mb/s
Disk: pool 75-110MB/s
CPU: -
Time: 4:48

Encrypted Pool Small file results
Network: 123Mb/s
Disk: pool 20-50MB/s
CPU: -
Time: 11:23

Regular Pool Large file results
Network: 335Mb/s
Disk: pool 50-100MB/s avg 75MB/s
CPU: -
Time: 4:26

Regular Pool Small file results
Network: 135Mb/s
Disk: pool 20-60MB/s
CPU: -
Time: 11:16

With SLOG

Encrypted Pool Large file results
Network: 460Mb/s
Disk: pool 90-120MB/s; vdev 40-65MB/s; SLOG 50-55MB/s
CPU: -
Time: 3:10

Encrypted Pool Small file results
Network: 490Mb/s
Disk: pool 120MB/s; vdev 60-65MB/s; 50-60MB/s
CPU: -
Time: 2:57

Regular Pool Large file results
Network: 466Mb/s
Disk: pool 80-120MB/s avg 115MB/s; vdev 30-65MB/s avg 60MB/s; SLOG 50-60MB/s
CPU: 70-90% idle
Time: 3:06

Regular Pool Small file results
Network: 445Mb/s
Disk: pool 80-120MB/s avg 116MB/s; vdev 40-65MB/s avg 60MB/s; SLOG 50-55MB/s
CPU: 85-90% idle
Time: 3:14

I would assume because the ram disk performed so well that the bottleneck isn't my pool, which is RAIDZ2 and I know is not as performance as many striped mirror vdevs. It appears that with the ramdisk I am bottlenecked by my network but with the Samsung 970 evos I am bottlenecked by the SLOG devices themselves despite them being listed as capable of handling much greater write speeds. I did not over provision the devices or change other settings yet and would love advice on what are better ways to test my system and specific settings/tweaks I should be making.

Edit: If I decide to get a SLOG device I've looked around and figure I should get
2 x 120GB Samsung SM863 SATA SSDs
2 x 100GB Intel Optane SSD DC P4801X M.2 PCIex4 SSDs
Would happily consider other devices keeping in mind that I have 2 pools to add SLOG devices to, plenty of available SATA ports (hence the SM863s), 2 M.2 slots thanks to that adapter and no remaining PCIe slots since I have plans for my final x4 slot.

dlavigne · Aug 28, 2019

Did you decide upon your solution?

HoneyBadger · Aug 28, 2019

PhiloEpisteme said:
Test SLOG devices: 2x 240GB Samsung 970 EVOs plugged into a Supermicro AOC-SLG3-2M2 PCIe3.0x8 -> 2x M.2 @ PCIe3.0x4. No PLP protection so obviously poor long-term SLOG devices but I had them lying around to test with.

To simplify, the lack of PLP on these devices directly results in their inability to operate as a high-performance SLOG.

When ZFS pushes data to the ZIL, it requests that the devices backing it flush their non-volatile cache. A device with true PLP for in-flight writes will utilize a supercapacitor, capacitor bank, or other means of stored energy to assure that the contents of the cache drive can be committed to NAND in case of a power loss, so it can respond with "data is safe" and then asynchronously flush its own cache.

Since the 970 EVO lacks power-loss-protection for in-flight writes, the only way it can satisfy this request is to immediately commit the data to NAND. While SSDs are fast, they still require some time to shift the necessary electrons around and reply "Okay, it's done."

In essence, proper PLP is a requirement to get any kind of decent performance. Very early SSDs had been reported to simply ignore cache flushes and always respond as if they had PLP, but any modern SSD with a manufacturer that cares about their image doesn't do that anymore.

PhiloEpisteme said:
Edit: If I decide to get a SLOG device I've looked around and figure I should get
2 x 120GB Samsung SM863 SATA SSDs
2 x 100GB Intel Optane SSD DC P4801X M.2 PCIex4 SSDs
Would happily consider other devices keeping in mind that I have 2 pools to add SLOG devices to, plenty of available SATA ports (hence the SM863s), 2 M.2 slots thanks to that adapter and no remaining PCIe slots since I have plans for my final x4 slot.

The Optane P4801X drives will absolutely crush those poor Samsung units; not only are they NVMe vs. SATA, but they're some of the fastest NAND units out there. In order to beat them, you have to enter the weird and wonderful world of non-volatile RAM cards or NVDIMMs. Several users have benchmarked the Optane cards in the thread in my signature, here's a direct link to the results you could expect from the Optane P4801X 200G:

https://www.ixsystems.com/community...nding-the-best-slog.63521/page-10#post-524347

PhiloEpisteme · Sep 3, 2019

@HoneyBadger thanks for the reply.

HoneyBadger said:
Since the 970 EVO lacks power-loss-protection for in-flight writes, the only way it can satisfy this request is to immediately commit the data to NAND. While SSDs are fast, they still require some time to shift the necessary electrons around and reply "Okay, it's done."

What does this imply about PLP and how these 970s are vulnerable to loss of power? If these drives with for data to be committed to NAND before reporting "okay, it's done." at what point in the process could data exist only in memory and not yet be in volatile storage on the 970s after they have reported "I've got it" and thus risking sync writes to power loss? I'm sure my confusion is based on my naive understanding of the risk there in general which is basically that for sync writes when using non PLP devices it is possible that the system flushes data from ram to SLOG but that the SLOG device reports "I've got it" when in fact it hasn't yet committed it to storage which is resilient to power loss; thus the risk to data loss for sync writes when using a SLOG without PLP.

HoneyBadger said:
not only are they NVMe vs. SATA

I believe the 970s I tested with are also NVMe.

Samsung Specification Page said:
INTERFACE:
PCIe Gen 3.0 x4, NVMe 1.3

HoneyBadger said:
The Optane P4801X drives will absolutely crush those poor Samsung units

This is certainly what I want to hear. I have seen the benchmarks you listed above but wanted to see the actual performance from my machine. I'd like make it such that my bottle-neck is my network.

dlavigne said:
Did you decide upon your solution?

Half way. I think I will go with the P4801X 200G devices for one of my pools, for the remainder I'll need to find a SATA solution as I'm out of PCIe and M.2 slots. I'm not sure which SATA solution looks best though. I'm having difficulty finding good devices that have PLP which are SATA. The best I've found are refurbished 120GB Samsung SM863 SSDs. I haven't done as much research into the SATA drives yet so am happy to look at suggestions.

HoneyBadger · Sep 3, 2019

PhiloEpisteme said:
What does this imply about PLP and how these 970s are vulnerable to loss of power? If these drives with for data to be committed to NAND before reporting "okay, it's done." at what point in the process could data exist only in memory and not yet be in volatile storage on the 970s after they have reported "I've got it" and thus risking sync writes to power loss? I'm sure my confusion is based on my naive understanding of the risk there in general which is basically that for sync writes when using non PLP devices it is possible that the system flushes data from ram to SLOG but that the SLOG device reports "I've got it" when in fact it hasn't yet committed it to storage which is resilient to power loss; thus the risk to data loss for sync writes when using a SLOG without PLP.

Once sync writes are enabled, there's no longer a risk to data in flight, unless you have one of the very early SSDs that lied in response to cache flush requests. The issue is that without PLP (such as the Samsung 970s) those sync writes will be painfully slow due to the requirement to push data to NAND for every write.

PhiloEpisteme said:
(Optane performance) is certainly what I want to hear. I have seen the benchmarks you listed above but wanted to see the actual performance from my machine. I'd like make it such that my bottle-neck is my network.

Actual performance will depend on the record size that will be written. Since you're only indicating 1Gbps networking, that should be doable with the P4801X as even the smallest possible recordsize of 4K shows a result of about 170MB/s

PhiloEpisteme said:
Half way. I think I will go with the P4801X 200G devices for one of my pools, for the remainder I'll need to find a SATA solution as I'm out of PCIe and M.2 slots. I'm not sure which SATA solution looks best though. I'm having difficulty finding good devices that have PLP which are SATA. The best I've found are refurbished 120GB Samsung SM863 SSDs. I haven't done as much research into the SATA drives yet so am happy to look at suggestions.

Look at the Intel DC S3700, it's one of the fastest and most consistent drives. It will still be significantly slower than an NVMe device like the P3700 though.

PhiloEpisteme · Sep 3, 2019

HoneyBadger said:
Once sync writes are enabled, there's no longer a risk to data in flight, unless you have one of the very early SSDs that lied in response to cache flush requests. The issue is that without PLP (such as the Samsung 970s) those sync writes will be painfully slow due to the requirement to push data to NAND for every write.

Ah, this is what I was expecting based on your earlier explanation. So would you say that it isn't entirely true that if someone were to use the Samsung 970s as a SLOG that they are entirely defeating the purpose of using it? There are posts abound about this warning that if you're going to use a device without PLP that one shouldn't use it at all and should just turn off sync writes, though perhaps these are all relating to one of the devices which "lie" about the state of the data. I'm asking more for personal clarification and not as an intention to use the 970s, I wasn't impressed enough with their performance and I'd rather spend the money on better devices.

HoneyBadger said:
Look at the Intel DC S3700, it's one of the fastest and most consistent drives. It will still be significantly slower than an NVMe device like the P3700 though.

Great, thanks for the recommendation. That one again is discontinued similar to the Samsung device I listed earlier. Are manufacturers simply discontinuing SATA type SSDs with PLP?

HoneyBadger · Sep 3, 2019

PhiloEpisteme said:
So would you say that it isn't entirely true that if someone were to use the Samsung 970s as a SLOG that they are entirely defeating the purpose of using it?

A lot of people conflate the concept of "sync writes" in general with "presence of an SLOG." The way you get your data to be safe is to use sync writes; then in order to regain some of that performance, you attach an SLOG device. Since the 970 Pro isn't particularly fast, it defeats the purpose of the SLOG, but it doesn't invalidate the sync writes being requested (and fulfilled).

PhiloEpisteme said:
Great, thanks for the recommendation. That one again is discontinued similar to the Samsung device I listed earlier. Are manufacturers simply discontinuing SATA type SSDs with PLP?

There are still a few - the Samsung 883 DCT, some Intel D3-S series drives - and some SAS offerings from Toshiba and HGST, but a large number of devices designed for high write endurance or "enterprise" workloads have gone to NVMe for the higher performance ceiling.

Important Announcement for the TrueNAS Community.

Sanity check on test SLOG performance

PhiloEpisteme

Guru

dlavigne

Guest

HoneyBadger

actually does care

PhiloEpisteme

Guru

HoneyBadger

actually does care

PhiloEpisteme

Guru

HoneyBadger

actually does care

Similar threads