SLOG not changing disk performance of volume during testing

Status
Not open for further replies.

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Adding a SLOG won't speed up your writes... Here are some interesting articles about the SLOG, key quote from the first one:


http://nex7.blogspot.com/2013/04/zfs-intent-log.html
https://forums.freenas.org/index.php?threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/
Ok thanks, in which case I won't bother testing too much more with the SLOG for this particular query.

So how does write caching actually flow? I assume it gets stored in RAM, then flushed to disk? Does the flush happen sync or async?
Should I not be getting more than 450MB/s if the system RAM is the write cache? Again the fact this figure doesn't change if I add more disks confuses me!
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Ok thanks, in which case I won't bother testing too much more with the SLOG for this particular query.

So how does write caching actually flow? I assume it gets stored in RAM, then flushed to disk? Does the flush happen sync or async?
Should I not be getting more than 450MB/s if the system RAM is the write cache? Again the fact this figure doesn't change if I add more disks confuses me!
Read the two articles for answers to most of your questions... Basically, the SLOG is only useful for synchronous writes; ZFS gathers up pending buffers into a transaction group and writes them to the SLOG device periodically. This moves the data out of memory, from which it would disappear in a power outage, onto a non-volatile device. The SLOG device also needs power protection, again, for the reasons described very well in those to links (and elsewhere here on the forum).

There are really only a few situations where a SLOG is useful: heavy databases, storage for virtual machines, etc. In any case, you will always get better write performance without a SLOG, in my experience. I only use one on my VM datastore, and I don't really need it, as I'm just running a home lab with no critical/enterprise applications or data at stake.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
So an excerpt from one of the article on these forums:
A ZFS system has a potentially much larger write cache, but it is in system RAM, so it is volatile. The ZIL provides a mechanism to implement nonvolatile storage. However, it is not a write cache! It is an intent log. The ZIL is not in the data path, and instead sits alongside the data path. ZFS can then safely commit sync writes to the ZIL while simultaneously disregarding sync and aggregating the writes in the normal ZFS transaction group write process.

So my RAM is my write cache. The system has 32GB RAM, so a 1GB test file should fit in this cache and should fit in one transaction group? It says that the sync write can be commited to ZIL, then aggregating the write into the next transaction group process, which to me sounds like the write is cached in RAM, commited to ZIL (at which point the client should report write complete), then the ZIL gets flushed to disk during next transaction group processing?

So, I would expect my RAM to be able to store a 1GB file quicker than 450MB/s, with or without the SLOG. If snc=disabled, then once in RAM the client should report the write is complete, and if I have a SLOG and have sync=always the client should report the write is complete once it is written to the SLOG/
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
I have moved the pool to a 6 disk stripe with sync disabled and no SLOG.......same exact performance results.

What else do I need to look at as these results just don't seem to make any sense!
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
I have moved the pool to a 6 disk stripe with sync disabled and no SLOG.......same exact performance results.

What else do I need to look at as these results just don't seem to make any sense!
Again, a SLOG device is not going to improve performance -- they're strictly for safety.

Fastest write performance will be with no SLOG device and asynchronous writes, at the cost of possible data loss in the event of device failure or power loss.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Yes, but my question is now why is my performance not increasing when moving from a 4 disk mirrored vdev, to a 6 disk stripe when sync is disabled?
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Yes, but my question is now why is my performance not increasing when moving from a 4 disk mirrored vdev, to a 6 disk stripe when sync is disabled?
What is your network speed? Unless you're running at something faster than Gb Ethernet, ZFS can easily saturate your network with either of the pool configurations you're testing.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
The FreeNAS install is itself a VM with an LSI HBA passed through to it. As such, it is hosting a datastore for ESXi over an internal vswitch with 10GbE adapters.
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
I am not using NFS, I am using iSCSI. As such the service isn't running, but checked anyway and has 4 as the number.


Sorry, When I read the thread I though I saw that you were testing both.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Sorry, When I read the thread I though I saw that you were testing both.
No my fault, I should have been a bit more explicitly clear on my setup before asking questions.
Am just using iSCSI as I knew it would also require me to test out sync options which I needed to familiarise myself with.

Now if I can just figure out my write speed query I'll be golden.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Are you in need of high write THROUGHPUT or IOPS? You cant have both unless your going all flash. With Sync Always there is NO write cache. Writes come in and get written to the ZIL (ZFS Intent Log) by default the ZIL is on the same zpool as your zvol so your sync writes happen twice. Once to the ZIL in the order they are received and once to the zvol in an optimised way as a transaction group (TXG). a SLOG is a "Separate LOG". This means you can write to the ZIL and the vdev at the same time. In a "normal" VM environment, these writes will be very random and slow as heck on spinners. This is why we use SSDs for the SLOG (among other reasons). SSDs perform extreamly will with small random IO so they can "absorb" the randomness of the IO and then write out optimized TXGs to the spinners.

Sync IO Path:
VM does some stuff
Sync writes go to RAM & ZIL (either on SLOG or your zpool where ever your ZIL is)
Only once the write to the ZIL is confirmed, the next IO can be processed.
Once 5 seconds(default) (or a set size) passes, the transaction group will be closed, reordered for optimised writes and written to disk FROM RAM. This means you only read your SLOG if there is a crash. it also means you you only use at most a couple GB of your SSD.
As you can see, this is still going to be limited to the speed of your spinning disks, except we get to write under ideal conditions, large (mostly)sequential writes.

Async IO Path:
VM does some stuff
Async writes go to a transaction group in RAM (still a ZIL of sorts)
The write is "confirmed" even though it's not on disk, the next IO can be processed.
Once 5 seconds(default) (or a set size) passes, the transaction group will be closed, reordered for optimised writes and written to disk from RAM.

Yes, but my question is now why is my performance not increasing when moving from a 4 disk mirrored vdev, to a 6 disk stripe when sync is disabled?
What are your numbers on this? When you say striped, do you mean raidz1/2/3 or plain jane stripeing for perfomance banchmarking? Also are you using VMXNET3 for your vm NICs? Whats your CPU usage look like during the test? It one core (on the ESXi host) pegged?
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
On some vdevs I need throughput, and some I need IOPS. I am planning a mirrored set of vdevs with SLOG and sync=always for VMs/IOPS, and then RAIDZ for where throughput is the concern with sync=disabled.

So with sync writes, if we are talking a 1GB test file sequentialy writing, and being written to RAM first, surely this should happen in under 5 seconds at which point the VM or client OS etc. confirms the file as written. So as far as the OS is concerned, job done, but then FreeNAS writes from RAM to disk which is invisible to the client OS?

Same principal for async, where the write to RAM of a 1GB file should happen pretty damn quick, the client OS reports job done but the actual write to disk then happens on the backend in FreeNAS?

As for my test, my speeds of a sequential read of a 1GB test file is about 900MB/s, writes being 450MB/s. This is the same when using 4 disk as mirrored vdevs, 6 disks as mirrored vdevs, or my current testing of plain old 6 disk stripe (Not RAIDZ).
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
What kind of impact would people expect the block size and sparse option to have on throughput?
Default options is non sparse and 32k. This is where I see my above mentioned numbers.

I have then tried 32k sparse, 64k sparse and 64k non sparse, and all give me more like what I would expect with about 900MB/s read and 1GB/s writes.

Screenshot uploaded of these results. All zvols on the same physical 6 disk stripe.
 

Attachments

  • Untitled.png
    Untitled.png
    241.1 KB · Views: 336

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
What kind of impact would people expect the block size and sparse option to have on throughput?
Default options is non sparse and 32k. This is where I see my above mentioned numbers.

I have then tried 32k sparse, 64k sparse and 64k non sparse, and all give me more like what I would expect with about 900MB/s read and 1GB/s writes.

Screenshot uploaded of these results. All zvols on the same physical 6 disk stripe.
Honestly, I think your (enviable!) results are about as good as you're going to get here in the 'Real World'. :D
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Are these results really that enviable? I am certainly very happy with the 1GB/s both ways results, I would love to understand though why 32k sparse, 64k and 64k sparse options have writes speeds 2.5x that of plain old 32k :P
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
So with sync writes, if we are talking a 1GB test file sequentialy writing, and being written to RAM first, surely this should happen in under 5 seconds at which point the VM or client OS etc. confirms the file as written. So as far as the OS is concerned, job done, but then FreeNAS writes from RAM to disk which is invisible to the client OS?

Same principal for async, where the write to RAM of a 1GB file should happen pretty damn quick, the client OS reports job done but the actual write to disk then happens on the backend in FreeNAS?

As for my test, my speeds of a sequential read of a 1GB test file is about 900MB/s, writes being 450MB/s. This is the same when using 4 disk as mirrored vdevs, 6 disks as mirrored vdevs, or my current testing of plain old 6 disk stripe (Not RAIDZ).
This is what's tripping you up. When the VM issues a write, sync or not, it will not continue until the the write request is fulfilled. In the case of async, this happens in the write cache (typically RAM). In the case of a VM level async write with underlying ZFS sync always, the host will not get the green light until that write hits the ZIL, yes that write stays in RAM and that's your working copy but in sync mode, it always waits for the ZIL. No matter what you will not sustain writes faster than your spinning disks can do sequentially or your network link whichever is slower. There are other bottlenecks but with what your doing, its not relevant.

EDIT: the 64/32 32k has double to overhead at every step of the way. Try doing the same test with 16k. I would bet its roughly half the speed. This is where you need to look at IOPS. It may be processing the same number of IOPS but the packets are smaller. And yes I said packets. Storage and networking performance have WAY more in common than most people realize. PPS and IOPS are basically the same thing just like throughput and throughput haha der. Networking you have buffers measured in bits and storage you have IO queues based on the number of IO operations instead of IO size.

You can tune your TXG size/time to accommodate larger bursts of traffic but that get fairly deep into the weeds as you MUST have an total understanding of your workloads exact characteristics to even think about this otherwise you will destroy your performance and potentially leave yourself open to lots of data loss.
I have then tried 32k sparse, 64k sparse and 64k non sparse, and all give me more like what I would expect with about 900MB/s read and 1GB/s writes.
This is one case where VAAI shows some real benefit. When writing sparse blocks ESXi is telling FreeNAS "Write a block with all 0s" instead of Write a block '0000000000000000.......'". All the work is being done on FreeNAS and there is no host/network IO overhead. Again your not writing out the entire file, just saying "Here's the pattern, do the work for me."
Edit2: This may even extend into the disk controller and disk themselves though that may be SAS and not SATA. I usually don't go that far down the storage stack.
 
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
If you want some baller benchmarks, try ATTO and turn off Direct I/O. THAT will enable VM or OS level caching and give you mega big awesome tremendous numbers. Also make sure your benchmark disk in your VM is paravirtualized and not just a SCSI disk. This can greatly reduce latency and help a bit with IOPS too.


EDIT: Also the benchmark numbers your looking at, the 1GBps, will almost never happen.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
My head hurts
Yeah, there are lots of layers to storage. Especially with using virtualized ZFS to back virtual machines.

Just take you time and learn one section at a time. Once you have a good grasp (and can explain it to others) move on to the next and see how they affect each other. That post from STUX while not super technical is a great starting point for SLOG performance expectations.

100MB/s sustained with a low queue depth is not bad by any means for a homelab. The SANs we have at work have 2-3 24 bay shelves of SSDs for caching and still only have 2 8 gbps Fiber uplinks per shelf. It's all about IOPS and keeping latency down under load.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
In the case of async, this happens in the write cache (typically RAM). In the case of a VM level async write with underlying ZFS sync always, the host will not get the green light until that write hits the ZIL, yes that write stays in RAM and that's your working copy but in sync mode, it always waits for the ZIL. No matter what you will not sustain writes faster than your spinning disks can do sequentially or your network link whichever is slower. There are other bottlenecks but with what your doing, its not relevant.

Ok, is the ZIL only used for sync writes? If so, let's ignore it for the moment, as I am now doing my testing without a SLOG to try and determine the max performance of the underlying disks/write cache before adding a SLOG, and as such have sync=disabled set on the zvol. With that being the case, you are saying that the write happens in the write cache, which should be my RAM. If I am writing a sequential file which is relatively small (1GB) and fits into this cache, it should happen quickly as RAM is fast. If it can do it quickly enough, the speed of the underlying disks shouldn't matter, as that happens after the VM has been told the write is complete? If 1GB can fit in my cache, then surely the write speed reported should be higher than 450MB/s, as it should be written to RAM quicker than that?

So example test:
File sent from VM/ESXi to FreeNAS over 10GbE, which should really be limited to 1GB/s in terms of network
FreeNAS should be able to write this into RAM quickly (under 5 seconds?)
Once in RAM, the VM should consider the write complete
The data is then written to physical disks from RAM

EDIT: the 64/32 32k has double to overhead at every step of the way. Try doing the same test with 16k. I would bet its roughly half the speed.

Ok, so if 32k maxes writes at 450MB/s, the only reason that writes are faster when still at 32k but sparse, is because ESXi has less to do? If my 32k sparse and 64k sparse/non-sparse all produce the same results, is it better to stick to 32k for better IOPs?

I want to make sure I can get the physical disks/pool configured correctly for my workloads, so that performance is not something I have to tinker with down the line while I potentially have the server in live usage. Toying with settings to find my optimums is easier to do now while I can destroy and rebuild the volumes as many times as I like.
 
Status
Not open for further replies.
Top