Sharing SLOG devices (striped SSDs partitioned for two pools)

sfcredfox · Mar 29, 2018

Experts,

I'm looking for proper command for adding a LOG partition to a pool by partition rather than the whole device.

This post (https://forums.freenas.org/index.php?threads/bad-performance-with-mirrored-ssds-as-slog.18262/) shows the proper way to create a single partition and add an underprovisioned LOG device, but I want to be sure I'm going to do this right for adding two partitions to the devices, then adding them to the pools correctly.

What do I want to do?
I want to take two S3700s, create two partitions on each, and add them as striped log devices for two pools, essentially allowing the pools to spread the writes across both devices instead of a single device for each pool. Wait until below to start arguing this :)

Code:

gpart create -s GPT daXX
gpart add -t freebsd -a 4k -s 8G daXX (for 8G SLOG)
zpool add pool log daXXp1

I still can't figure out how to do this correctly for two same size partitions on the same device.

After I create to two partitions (the part I can't figure out), I think I need to do a:

Code:

zpool add pool1 log mirror daX1p1 dax2p1  (to the first pool)
zpool add pool2 log mirror daX1p2 dax2p2  (to the second pool)

Wait, But why?
I have read a good amount of posts where people try to share a single SSD for cache and log. That's not what I'm trying to do here, the devices are still going to be dedicated to logs. I'm going to have two devices dedicated to LOG, just for two pools.

The arguments against (from my always growing knowledge base):

The whole point of having faster and faster devices for SLOG devices (in SYNC scenarios) is to get that acknowledgement back to the OS while the system is getting stuff (groups) written down to the pool, and the device has to be reliable (talks about not using inappropriate SSDs https://forums.freenas.org/index.php?threads/correct-way-to-remove-slog.43401/) in case of power loss right? So, overall, we want no latency.

Sharing a device for another purpose other than being a log device would/could slow that down, or cause competing IO that would potentially slow down that acknowledgement, right? If that was the case, we'd see the result of competing IO on the SLOG device in terms of higher latency by the initiator, correct? (I'm betting so)

The arguments for:
If given the opportunity to spread LOG IO across multiple devices, we can (hopefully/potentially) (1) lower our acknowledgment latency, and (2) double our throughput in those rare heavy sequential write scenarios. (I say rare, because my system is supporting virtualization and not copying large media files)

So, in an instance where one of the pools is not real busy, the other pool gets the benefit of using two devices to absorb LOG IO if I stripe them, rather than each of them having their own device, and one pool's unused SLOG can't benefit the other.

Overall fun part:
The rub is when both pools are busy and they are both throwing writes at the shared SLOG. This is ultimately what I'm interested in. I want to know/quantify if both pools sharing a set of striped SSDs introduces more latency than if it was a single SSD, and if so (which I'd bet it does by the way), by how much? What percent increase of latency is there? And, what percent of the time? Does the benefit of having stripped LOG devices with more throughput out weigh the latency introduced when/if both pools are busy? How often is that? All of this is workload/system/configuration specific, and that's why I want to test it on our lab scenario under actual loads, not synthetic tests. Exciting!

(Please answer my questions about getting the commands right before you post opinions on the overall concept, thanks!!)

Arwen · Mar 29, 2018

Interesting point.

One word of caution. If you loose EITHER of the SLOG devices during a recovery reboot, you could loose the data that was in-flight. Depending on use case, that could be so bad as to require a restore from backups, (For example, a use of Zvols used for VM storage.)

That's why ZFS now supports mirrored SLOGs. Remember, the WHOLE point of a SLOG is to make a reliable copy of the ZIL entries, fast. If you don't want reliable, you don't need a SLOG.

This makes me want to make a feature request for a global SLOG, which could be added to any pool.

sfcredfox · Mar 29, 2018

I hadn't thought of a global resource, but your point is valid. That's the different between mirror/stripe/and striped mirror if you want to throw more devices into it. A striped mirror would be pretty robust.

Overall, we worry about loosing a drive during that unexpected power loss scenario. If I have an unexpected shutdown/power loss, and manage to blow out an SSD as a part of a striped pair, that's exactly when I restore from all those backups I religiously do, and buy a lottery ticket.

How about the drive partitioning, I am more interested in getting the drives partitioned correctly, than the reliability debate at the moment.

Ericloewe · Mar 30, 2018

I'd personally avoid the extra complexity. Your gains are dubious and the risks hard to really quantify.

sfcredfox · Mar 30, 2018

Ericloewe said:
I'd personally avoid the extra complexity. Your gains are dubious and the risks hard to really quantify.

Right, but does anyone know how to use GPART to create the two partitions correctly, so I at least get that part done correctly? :)

Do I need to do anything different other than:

Code:

gpart create -s GPT daXX
gpart add -t freebsd -a 4k -s 8G daXX (for 8G SLOG)
gpart add -t freebsd -a 4k -s 8G daXX (for 8G SLOG)

I'm not going to be talked out of spending time doing something I consider fun...I guess I should have just asked the question about using gpart and said the man page wasn't clear enough for me :)

Arwen · Apr 1, 2018

Yes, it should be as simple as creating 2 partitions, then assigning them to your pools.

The GUI may not show the change, (like some other command line work), but a simple reboot should take care of that.

wblock · Apr 2, 2018

sfcredfox said:

Although this is not recommended, I am curious to see your results. Please benchmark thoroughly before adding these SLOG devices. Those benchmarks should also include ones with a single-pool SLOG as recommended, to compare meaningfully.

That said, suggestions:
ZFS will currently use 16G of a SLOG device, so make the partitions that large.
Also, the partition type should be freebsd-zfs.
For alignment and possibly block erase efficiency, I would align to 1M, not just 4K.
And always, always set a GPT label. It costs nothing and can help identification later.

So (again, not a recommendation to try shared SLOGs, but just to try this test fairly):

Code:

gpart create -s GPT daXX
gpart add -t freebsd-zfs -a 1m -l sloga -s 16G daXX (for 16G SLOG)
gpart add -t freebsd-zfs -a 1m -l slogb -s 16G daXX (for 16G SLOG)

sfcredfox · Apr 7, 2018

Thanks for the code, that finally works perfect, two partitions created.

Yeah, I want to do a couple of different tests, some more geared to a workload and some synthetic.

1. DD, just to post raw numbers. Wouldn't reflect real work load, but could be used for identifying if performance/latency increases on a shared log vs dedicated log vs no log (with SYNC).

2. VMware IO Analyzer VM with a 20GB data drive, and their SQL load (16K,66% read,100% random). It gets cached after running a little while, but I can replicate the test taking everything into account and is more similar to a real work load than anything.

We'll see what happens.

sfcredfox · Apr 13, 2018

wblock said:
Although this is not recommended, I am curious to see your results.

Since you were interested, preliminary data shows there being a benefit in striping/sharing the SLOG, until you hit the device's (SLOG device) max throughput with both pools writing to it. The write latency on each partition hangs out at .1 (normal for a dedicated SLOG on my system) staying under the throughput limit. Going to or pushing the throughput limit sends the device latency up to anywhere from 3-6 ms according to GSTAT. I have to test that response when dedicated and see if it matches or is indeed higher.

It's very difficult to test this using realistic means due to caching. I'm still working on it. It's hard to separate throughput from response time, and harder to show meaningful increase/decrease in response times to applications/VMs that are relying on the underlying device (SLOG) for acknowledgement. I hate those 'feeling' type test conclusions people like to cite.

sfcredfox · Apr 20, 2018

For anyone interested, I posted the results of my shared SLOG testing. I'm sure with the collection of expertise and knowledge of this forum's users, there will be some opinions of better ways to have shown or tested this, but I hope it simply shows one possible, albeit unlikely solution for increasing performance when suitable.

Chris Moore · May 4, 2018

sfcredfox said:
For anyone interested, I posted the results of my shared SLOG testing. I'm sure with the collection of expertise and knowledge of this forum's users, there will be some opinions of better ways to have shown or tested this, but I hope it simply shows one possible, albeit unlikely solution for increasing performance when suitable.

I am interested in this but It looked to me as though all the tests were done with a configuration other than what was discussed in this thread. In the thread you discussed Intel S3700 but you state S3500 in your PDF. I was also expecting to see test results with no SLOG, with a single dedicated SLOG and with the striped SLOG, then with the striped SLOG shared with another pool. Seeing all scenarios would provide a clear picture of the change in functionality. The results you provided make it look like you do loose performance when sharing the SLOG between two pools, unless I am misinterpreting these numbers.

sfcredfox said:
I hate those 'feeling' type test conclusions people like to cite.

What is your opinion of the results?

sfcredfox · May 4, 2018

Well, You're right, I didn't have the two 3700's until recent. I guess I'd have to say all the testing was done with 3500s, so you'd expect the same result when re-doing all the testing with 3700s. Sometimes it's all relative, sometimes not.

As for as the testing selected, I didn't set out to compare which setup was better between a stripped SLOG and a single device. I'm fairly confident I can predict which would be better in that scenario :)

I set out to test my system (which had two pools, each with an SSD S3500 SLOG and a generic SSD cache) and see if I could get better results (for my VMs) by combining the SLOGs and cache drives into striped configs and share them between both pools. The results I interpreted showed me less write latency for both pools when simultaneously running heavy test loads (the VMware IO Analysis SQL 16K and Exchange 8K tests).

The way I see it, if sharing the two devices between two pools would have caused increased latency and delay because each pool had to compete with the other one for writing to the SLOGs, then the VMs would have shown increased write latency. They did not, they showed less. So apparently (my interpretation), the pools did not have to compete with each other enough to cause an increase in latency, but the stripping of the SLOGs allowed each pool to get job done faster (slightly). One could imagine the performance increase would have been even better with only one pool writing to the stripped SLOG, as mentioned above.

Have you though about doing some similar testing? I'd be interested in your results to compare the relative reaction on another system.

Ericloewe · May 4, 2018

If anyone has interesting drives they can play with, here's something I'd like to draw your attention to:
https://forums.freenas.org/index.php?resources/slog-benchmarking-and-finding-the-best-slog.94/

He's looking for data to compile.

sfcredfox · May 4, 2018

Chris Moore said:
What is your opinion of the results?

I didn't 'feel' any difference at all. It's pretty hard to perceive a change in write latency as small as was seen.

I believe the data quantifies a reduction in write latency for VMs, and thus proves that "sharing SLOG devices between pools" doesn't guarantee some negative impact on every system, but on some systems, could actually improve performance. Make sense?

Elliot Dierksen · Sep 21, 2018

I know this thread has sat idle for a bit, but I used the info in it for my backup FreeNAS. I added a 280G Optane 900P to it, and I wanted to share that as the SLOG for the two pools I have in external arrays. Here is what I did.

Code:

# 3 way split of Intel Optane 280G

gpart destroy -F nvd0 # just making sure

gpart create -s GPT nvd0
gpart add -t freebsd-zfs -a 1m -l sloga -s 88G nvd0
gpart add -t freebsd-zfs -a 1m -l slogb -s 88G nvd0
gpart add -t freebsd-zfs -a 1m -l slogc nvd0 # takes the rest which is ~=85G

zpool add TEST2 log nvd0p1
zpool add TEST3 log nvd0p2

slogc/nvd0p3 is in reserve right now in case I add another pool in the internal drive bays. Sizes of the partitions are a lazy attempt at splitting the device 3 ways. I seem to recall some mention that FreeNAS/ZFS would only use a certain amount of SLOG space, so having one that was really big was overkill/wasted. Did I hallucinate that?

Edit: I didn't hallucinate it. I just didn't bother to scroll back to where @wblock mentioned a 16GB limit for an SLOG device. Is that likely to increase at any point?

Edit 2: I wasn't able to add the partitioned SLOG from the GUI. That is why I did it from the CLI.

HoneyBadger · Sep 21, 2018

Elliot Dierksen said:
Edit: I didn't hallucinate it. I just didn't bother to scroll back to where @wblock mentioned a 16GB limit for an SLOG device. Is that likely to increase at any point?

You're quick on the edit button today, I was about to pull-quote wblock's post above and when I went to quote yours you'd thrown that in there.

Re: the 16GB limit, I was under the impression that the maximum amount of "dirty" data that ZFS will allow before blocking writes entirely (ie: your maximum SLOG size) is limited by default to 4GB or 1/10th of your system RAM, whichever is lower - you can look at vfs.zfs.dirty_data_max to see what yours is.

The old tunables (write_limit_*) are deprecated.

Edit: Which system is this going in? The layout/pool count makes me think system #2, is it just not updated to show the Optane SLOG yet? I recall you saying you just bought another one on sale.

Chris Moore · Sep 22, 2018

Elliot Dierksen said:
I know this thread has sat idle for a bit, but I used the info in it for my backup FreeNAS.

Just to throw some examples / numbers out there for the gratification of anyone else that comes along wanting to do this.
I used four SAS SSDs like this:

Code:

Vendor:			   HITACHI
Product:			  HUSSL4010BSS600

With them I created both SLOG and L2ARC partitions on each using these commands:

Code:

gpart create -s gpt da0
gpart add -i 1 -b 128 -t freebsd-swap -s 8g da0
gpart add -i 2 -t freebsd-zfs -s 16g da0
gpart add -i 3 -t freebsd-zfs da0

gpart create -s gpt da1
gpart add -i 1 -b 128 -t freebsd-swap -s 8g da1
gpart add -i 2 -t freebsd-zfs -s 16g da1
gpart add -i 3 -t freebsd-zfs da1

gpart create -s gpt da2
gpart add -i 1 -b 128 -t freebsd-swap -s 8g da2
gpart add -i 2 -t freebsd-zfs -s 16g da2
gpart add -i 3 -t freebsd-zfs da2

gpart create -s gpt da3
gpart add -i 1 -b 128 -t freebsd-swap -s 8g da3
gpart add -i 2 -t freebsd-zfs -s 16g da3
gpart add -i 3 -t freebsd-zfs da3

Ignore the swap partition because it isn't really relevant. I just included exactly what I did for completeness.

I then grep for the gptid with the following comands:

Code:

glabel status | grep da0

glabel status | grep da1

glabel status | grep da2

glabel status | grep da3

With the gptid for each of the partitions, I used the following code to add both SLOG and L2ARC to my pool:

Code:

zpool add Emily log mirror gptid/9b563f1a-beb9-11e8-b1c8-0cc47a9cd5a4 gptid/9b798237-beb9-11e8-b1c8-0cc47a9cd5a4
zpool add Emily log mirror gptid/9b9ad00a-beb9-11e8-b1c8-0cc47a9cd5a4 gptid/9bbe5e8b-beb9-11e8-b1c8-0cc47a9cd5a4
zpool add Emily cache gptid/9bc78e88-beb9-11e8-b1c8-0cc47a9cd5a4
zpool add Emily cache gptid/9ba58419-beb9-11e8-b1c8-0cc47a9cd5a4
zpool add Emily cache gptid/9b844484-beb9-11e8-b1c8-0cc47a9cd5a4
zpool add Emily cache gptid/9b602cb7-beb9-11e8-b1c8-0cc47a9cd5a4

I tested my results by copying files, so not very scientific, but these are the results and I think that part of the maximum speed I am seeing is a limitation of my home brew network switch or just the maximum capability of my pool, but I have not done enough testing to isolate the issue and I am fairly well satisfied by the results.
With sync set to standard, when I would copy files to the NAS, the speed of the copy would ramp up like this:

That isn't too bad, but I wanted to see the results with SLOG enabled, so I set the pool to sync always.
The speed copying files was better because it started fast and stayed pretty decently fast like this:

I also wanted to test how much, if any, the L2ARC affected copying files from the FreeNAS to my system.
I know that the data needs to be loaded into L2ARC before that has any affect on speed, so I copied the same data repeatedly from the FreeNAS to my system and the results never increased beyond this level:

The "from Linux Laptop" is the name of a directory on my NAS, because the files were copied onto the NAS from my Linux Laptop. My filing system needs work, but the speed was good. I don't know that it made much difference though because before I added the L2ARC, I was getting speeds like this:

With what I know about how the L2ARC works, I didn't really expect it to be a big deal, I just wanted to document this for anyone else that might want to know.

Then I removed all the SLOG partitions and L2ARC partitions and did it all over using my NVMe drive:

Code:

Model Number:					   INTEL SSDPEDMD400G4

Mostly to get the 'numbers' for others to see but I was also curious about what difference there would be between the two options.
So, formatted the NVMe drive much like I had the SAS SSDs using the following commands:

Code:

gpart create -s gpt nvd0
gpart add -i 1 -b 128 -t freebsd-swap -s 16g nvd0
gpart add -i 2 -t freebsd-zfs -s 20g nvd0
gpart add -i 3 -t freebsd-zfs nvd0

Then I grep for the glabel:

Code:

glabel status | grep nvd0

Then I use that to add the SLOG and L2ARC to the pool:

Code:

zpool add Emily log gptid/ae487c50-bec3-11e8-b1c8-0cc47a9cd5a4

zpool add Emily cache gptid/ae52d59d-bec3-11e8-b1c8-0cc47a9cd5a4

In testing, I already knew about performance without SLOG, so I first tested copying files from my computer to the NAS with sync set to always using the NVMe drive:

This started out significantly faster than the SAS SSDs did and even though it tapered off during the transfer, it finished at a higher speed than the SAS drives ever achieved. I like that.

My testing may not be comprehensive, or scientific, but I didn't see any real change with copying files from the NAS to my desktop, so I feel the limitation is somewhere other than the one device I was trying to test (the NVMe drive) and I didn't have any more time to fiddle with it this weekend.

I hope this helps illuminate someone to the joys of SLOG and L2ARC.

Stux · Sep 22, 2018

Sort of concurs with my results, where I found I hit a bottleneck even with a ram slog.

I’d like to determine what the bottleneck was, but there’s always more important things to do :)

FWIW, I think it may be memory bandwidth or clock speed.

Elliot Dierksen · Sep 23, 2018

HoneyBadger said:
Edit: Which system is this going in? The layout/pool count makes me think system #2, is it just not updated to show the Optane SLOG yet? I recall you saying you just bought another one on sale.

You are correct, sir! The first system already has an Optane in it. I just haven't updated my sig yet.

HoneyBadger · Sep 24, 2018

Elliot Dierksen said:
You are correct, sir! The first system already has an Optane in it. I just haven't updated my sig yet.

I'd suggest removing doing a secure-erase on the Optane drive (for its internal wear-leveling purposes) and then slicing it into three partitions of 8GB each. Optionally you could set the MaximumLBA setting on the drive itself to 24GB after the secure-erase (3 partitions of 8GB) and then you'd guarantee a lot of clean, shiny blocks for wear-leveling over time.

As far as "are you using more than X GB of your SLOG" - let's get our hands dirty.

Once you've attached your new SLOG partitions (or using the ones you have now) create a file dirty.d using vi or nano and dump this code in there:

Code:

txg-syncing
{
		this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
		printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
			`zfs_dirty_data_max / 1024 / 1024);
}

Then from a shell do dtrace -s dirty.d YourPool and wait. You'll see a bunch of lines that look like:

Code:

dtrace: script 'dirty.d' matched 2 probes
CPU	 ID					FUNCTION:NAME
  4  56342				 none:txg-syncing   62MB of 4096MB used
  4  56342				 none:txg-syncing   64MB of 4096MB used
  5  56342				 none:txg-syncing   64MB of 4096MB used

Hammer your pool with some write load and watch the first number. That's how much data your SLOG is holding. You could increase that amount via tunables, but if you're routinely holding a lot of dirty data, that would indicate that your vdevs aren't fast enough to keep up with incoming txgs.

Important Announcement for The TrueNAS Community.

Sharing SLOG devices (striped SSDs partitioned for two pools)

Patron

MVP

Patron

Server Wrangler

Patron

MVP

Documentation Engineer

Patron

Patron

Patron

Attachments

Hall of Famer

Patron

Server Wrangler

Patron

Guru

actually does care

Hall of Famer

MVP

Guru

actually does care

Similar threads

Important Announcement for The TrueNAS Community.