Write caching

propersystems · Aug 22, 2019

I have freshly built freenass server with 12x10TB array WB Gold drives, running on EPYC CPU, 128GB ram. With 40Gbs networking link for storage access.

Raid configuration is raidz-3. Read performance is where it is expected to be and writes are where they are expected - pretty low around 280Mbps

I want to boost write performance, as much as possible, and typical solution is to add non volatile fast caching drive to hold the data until it is written to disk
I have quad nvme pcie card with 4x512 Samsung 970pro nvme drives, I can create raid 10 disk and use it as write cache.

I had this running well in Qnap, how complex will it be with Freenass?

PhiloEpisteme · Aug 22, 2019

Hi @propersystems welcome to the forums.

propersystems said:
I want to boost write performance, as much as possible, and typical solution is to add non volatile fast caching drive to hold the data until it is written to disk

I assume you're talking about adding a SLOG device to your pool. If you are, it is worth nothing that a SLOG device is not a write cache. ZFS has what is called the ZIL or ZFS Intent Log. It stores transaction groups that have not yet been fully committed to the pool. In normal operation the data on the ZIL is ignored because the system eventually writes the data to the pool and the ZIL's data is not required. In the event of an untimely power outage or other interruption ZFS will check the ZIL for any outstanding data and write it to your pool to prevent data loss.

For async writes your system will report back to the client that the write has been successful when the data is in memory, meaning a power loss could cause it to be lost since it has not yet been written to any permanent storage. For sync writes however your system will not report that a write has been completed until it has made it to permanent storage, in this case the ZIL. You can see then that writes will be very slow for sync writes to a pool composed of spinning rust where the ZIL lives in the pool. To help alleviate this issue you can add a SLOG device, which is a dedicated ZIL device for a pool. If your SLOG device is super fast and has power loss protection it can serve the data integrity purpose of the ZIL and help improve your sync write performance. It will not improve your async write performance for the reasons mentioned previously.

There are posts abound about the ZIL and SLOG, what they are, ideal devices, etc that are more in depth and technically correct than my short summary above.

propersystems said:
I have quad nvme pcie card with 4x512 Samsung 970pro nvme drives

These would not be good SLOG devices. They do not have Power Loss Protection and so they defeat the whole purpose of the ZIL/SLOG which is to protect your data in the event of a power loss. If you're considering using these as your ZIL you could get better performance with the same data integrity by forcing all writes to be async writes.

propersystems said:
I can create raid 10 disk and use it as write cache.

Just a bit of a nit but FreeNAS uses ZFS which doesn't have Raid10. ZFS has pools which are composed of vdevs which are composed of disks. vdevs are where the redundancy lives and they can be either single-disk, mirrored disks, RAIDZ1, RAIDZ2 or RAIDZ3 vdevs which tolerate zero, n-1, 1, 2, and 3 drive failures respectively before the vdev is lost. If you lost a vdev in a pool you lost the pool so redundancy at the vdev level really matters. The RAID10 analogue in zfs would be a pool composed of two vdevs each of which are built of 2 mirrored drives. The terminology can matter a lot when folks say "RAID1" when they really mean mirrored vdev.

propersystems said:
I had this running well in Qnap, how complex will it be with Freenass?

Adding SLOG devices in FreeNAS is very simple once you pick appropriate devices. The User Guide should be helpful in that regard.

Anyway, I hope this helps. I am at work so I couldn't dig up links to the various helpful threads on these topics but if you search for the zfs terminology primer and the zil and slog explained you'll find some great resources out there. Happy to answer any questions or to be corrected if I've made a mistake above.

propersystems · Aug 22, 2019

Thank you for quick reply.

I will have to review info you posted to get better understanding of freenass specifics in this area.

Its puzzling to me why data can not be written to nvme storage first and then moved to disk array at a later time.
Nvme array can take it in at 3Gbs validate it and then move it to disk pool in the background.

I guess this is more of buffer drive than cache drive, although files could remain on the nvme for some time and requested if needed.
Mirror protected nvme should be safe from failure

I did not know ZFS does not support raid 10,

PhiloEpisteme · Aug 22, 2019

propersystems said:
Its puzzling to me why data can not be written to nvme storage first and then moved to disk array at a later time.
Nvme array can take it in at 3Gbs validate it and then move it to disk pool in the background.

Sorry, I may have done a poor job of explaining. When you use an nvme device as a SLOG the data is first written to the nvme device prior to being written to the pool. However, if your nvme device does not have power loss protection (PLP) you're leaving yourself vulnerable to data loss and defeating the purpose of the SLOG and ZIL. So, I'm not saying don't use an nvme device, I'm recommending you use one with PLP.

propersystems said:
I guess this is more of buffer drive than cache drive, although files could remain on the nvme for some time and requested if needed.
Mirror protected nvme should be safe from failure

It isn't really just a buffer either. In fact, as far as I know the ZIL is not read from in normal circumstances. For sync writes the system just writes transactions to the ZIL first while the system prepares to write data to the pool. Only if power is lost does the system read the ZIL and ask "what data didn't get committed to the pool before power was lost?" and then writes that data to the pool. A SLOG just means a separate ZFS log device, so using a fast nvme drive instead of the spinning rust in your pool.

propersystems said:
I did not know ZFS does not support raid 10

ZFS does support a raid 10 analogue. In ZFS if you wanted something like raid10 you would create a pool made of two vdevs and both of those vdevs would be mirrored vdevs composed of 2 disks. It really is important to keep in mind that pools are made of 1 or more vdevs striped together, vdevs are made of 1 or more disks. A pool is lost if a single vdev within the pool is lost. A vdev is lost if more drives within that vdev are lost than the vdev layout can handle. The vdev layouts are single-disk, mirror vdevs of n disks, raidz1, raidz2, raidz3. These vdevs types can tolerate 0, n-2, 1, 2, or 3 disks lost respectively.

Check out this primer, this description of vdevs, ZILs, SLOG, etc, and this nice terminology primer.

Of course, everything I'm saying is only relevant if what you're referring to originally was in fact a SLOG device. :)

Jailer · Aug 22, 2019

propersystems said:
I want to boost write performance, as much as possible

Your pool configuration is not conducive to meeting this goal. A better understanding of your workload might help others in determining a better pool layout.

propersystems said:
and typical solution is to add non volatile fast caching drive to hold the data until it is written to disk

In non ZFS systems, yes this can be accomplished this way. ZFS uses RAM as it's cache and holds data there until it is written.

propersystems · Aug 22, 2019

Let me clarify, from what i understand ZFS has two ways it can write data.
First method involves receiving data from the application that wants it written then sending the request to write it and on completion validating data has been written, only then application it told that write has been completed.
Second method is one way dump without integrity check, this speeds up the process because validation is not performed at least immediately but there is higher chance of data corruption

In both situation when write commands are executed they are logged in a file which allows the engine to "recreate" sequence of events at later time and reconstruct the data it may find missing or corrupt.

My understanding is that "buffering" in this case is giving high-speed battery protected storage for writing this log file which will store write commands much faster than main disk pool. These commands will then be read from the log to write the data. Obviously validation on the pool can not happen until data is written to the drives so there is higher chance for corruption in this case and relying on this method absolutely requires battery protected storage, which makes sense as power loss can lead to comands in ram not being written to log file and application being unaware.

It seems to me that battery protected log storage is in general a good idea.

The implementation i was looking for is different. My view of buffer/cache drive is to take in incoming data rapidly and store it, then move it to the main pool later.
This would mean the application makes write request and data is written and validated on the buffer drive, application is told that write is complete. We would now have file on the buffer drive as well as log data to reconstruct it.
Until copy is made to the main pool, buffer will host the file, once copy to main pool is made, space on the buffer can be reclaimed.
This would give nvme write speeds, given that buffer is sized properly and does not get saturated, and also maintain data integrity
This configuration is obviously my view of how I would love for buffer in freenass to work.

I think I will need to run separate nvme pool for things that need speed.

propersystems · Aug 22, 2019

Jailer said:
Your pool configuration is not conducive to meeting this goal. A better understanding of your workload might help others in determining a better pool layout.

In non ZFS systems, yes this can be accomplished this way. ZFS uses RAM as it's cache and holds data there until it is written.

Yes I agree, drive configuration is not optimal. However write buffering was my primary goal when I asked for advice on this.

I just started testing this server, and main pool does not need to be incredibly fast but I do want bit higher write speed on the pool itself
What would you recommendation be? Segmenting it down?

I also read that ZFS uses ram for caching, to me that would mean "apparent" write speeds should be high, as data is dumped in to ram. This server has 128Gigs and I tried putting in 512Gis and makes no difference in speed of writes.

Workload is mixed, it will host storage for systems to produce allot of small files for AI training, backups for some systems and VMs, and run few low impact VMs.
Reads will involve pulling data for AI training pipelines

kdragon75 · Aug 22, 2019

You all making good points but ZFS will always try to slow your effective write speed to your max sustainable steaming write speed. This is to prevent thrashing. A condition that can cause erratic and unpredictable performance.

kdragon75 · Aug 22, 2019

The best thing you can do for performance is to reconfigure the pool as two raidz2 vdevs of 6 disks each. You will lose one disk of capacity but double your write speed.

sretalla · Aug 23, 2019

I also think it's probably worth being said here that SLOG will max out at about 30GB due to timeouts, at which point you can only write at the speed it can be purged off to disk anyway, so throwing 4 NVME drives at it seems insane.

propersystems · Aug 24, 2019

The reason i borough this up is i

kdragon75 said:
The best thing you can do for performance is to reconfigure the pool as two raidz2 vdevs of 6 disks each. You will lose one disk of capacity but double your write speed.

What are the downsides to having multiple vdevs? I understand core ideas - each vdev has separate redundancy but is there anything that should be considered outside of redundancy segmentation implications.
Does having single z2 pool have something two z2 vdevs do not?

Also looking at benchmarks on multiple sites I did not see significant difference between 12 drive single z2 pool and 12 drives segmented in 6x2 vdev configuration

sretalla · Aug 24, 2019

propersystems said:
Does having single z2 pool have something two z2 vdevs do not?

Fewer drives lost to parity/greater total capacity.

Also, fewer IOs.

propersystems said:
Also looking at benchmarks on multiple sites I did not see significant difference between 12 drive single z2 pool and 12 drives segmented in 6x2 vdev configuration

Depends on the measurement used, since not all measurement tools take caching (ARC) and multiple threads of IO into account.

propersystems · Aug 26, 2019

I setup 2 raidz-2 vdev's and did some tests, write speed is between 400 - 600mbs , which fine for my applications.

I am looking for log ssd and seems like this one here has capacitor protection
https://www.amazon.com/gp/product/B013GH0JCC/ref=ppx_yo_dt_b_asin_title_o03_s00?ie=UTF8&psc=1

It this drive good for log application?

Also, I can not find bridge network configurations in GUI, do I need to do my own commands to bridge few interfaces, or there is a way to build simple bridge from the interface?

kdragon75 · Aug 26, 2019

That SSD may work but you may want two. If it fails, performance will tank for iSCSI and NFS. Also what do you need bridging for? Bridging makes the joins interfaces act like a switch

propersystems · Aug 26, 2019

kdragon75 said:
That SSD may work but you may want two. If it fails, performance will tank for iSCSI and NFS. Also what do you need bridging for? Bridging makes the joins interfaces act like a switch

I have multiple direct 40Gb links from compute nodes to freenas, I will be merging them together in to one peer to peer storage exchange network.
Bridging these interfaces together does exactly that and I can use one IP range for the storage network.

From what I see this has to be done via command line, no interface options. I will post code I used when i get this setup

kdragon75 · Aug 27, 2019

Nope. You need to read about storage best practice on vmware. Also time to do some research on network terminology. A bridge is not the same as a lag or etherchannel or LACP. You have a lot of research to do before you let anyone run anything on your storage. Please review ALL of the FreeNAS manual and do a deep dive into vmware storage requirements, concepts, and best practice. Then come back and ask more questions.

Correction: I see you understand what a bridge is. This is a BAD idea for performance reasons. All "switching"will be done in the CPU and this is SLOW. You need a switch as your CPU will likely struggle to keep up with 4 10gb links let alone 40gb. It will also add latency to your storage network. What type of 40gb cards are you using? You may be able to find second hand at reasonable prices.

propersystems · Aug 27, 2019

kdragon75 said:
Nope. You need to read about storage best practice on vmware. Also time to do some research on network terminology. A bridge is not the same as a lag or etherchannel or LACP. You have a lot of research to do before you let anyone run anything on your storage. Please review ALL of the FreeNAS manual and do a deep dive into vmware storage requirements, concepts, and best practice. Then come back and ask more questions.

Correction: I see you understand what a bridge is. This is a BAD idea for performance reasons. All "switching"will be done in the CPU and this is SLOW. You need a switch as your CPU will likely struggle to keep up with 4 10gb links let alone 40gb. It will also add latency to your storage network. What type of 40gb cards are you using? You may be able to find second hand at reasonable prices.

I am happy to look at any material you think will be useful.
I am using Proxmox for virtualization not vmware but I undertsand what you mean, In many cases shares are mounted directly inside the VM and not added as drive in the Vhost. I do not host VMs on freenass, not until I test NVME storage and performance.

LACP in my view, is a way to use multiple interfaces for fault tolerance, balancing and link aggregation. In other words making multiple interfaces behave as one for ease of use and allowing underlying management protocols to sort out how traffic is managed and routed once it reaches this virtual interface. I would call this a bond and I am not looking to do this.

Bridging is creating a virtual "switch" and connecting multiple physical or virtual interfaces to it, so that devices on these interfaces can communicate.
If I have four 40Gb network interfaces on the freenass and each one goes to VM node with multiple VMs, I can bridge them together on freenass, and this allows all nodes to exchange data on the storage network using single sub-net on this high speed link.

Yes, freenass server now becomes a "switch" fully managing significant traffic on this bridge, but this is why I have 32 thread EPYC CPU.
Performance headroom on this server is in place to allow expansion of the network, addition of extra capacity as well as NVME array for VM hosting. And even with that I have dobts this CPU will ever struggle.
I have done failry large bridged configurations this way before on linux and they worked solid, this is just my first time working with freenass.

I am using Mellanox cards X-3, they cost about 50-80$ used

jgreco · Aug 27, 2019

propersystems said:
Let me clarify, from what i understand ZFS has two ways it can write data.
First method involves receiving data from the application that wants it written then sending the request to write it and on completion validating data has been written, only then application it told that write has been completed.
Second method is one way dump without integrity check, this speeds up the process because validation is not performed at least immediately but there is higher chance of data corruption

In both situation when write commands are executed they are logged in a file which allows the engine to "recreate" sequence of events at later time and reconstruct the data it may find missing or corrupt.

My understanding is that "buffering" in this case is giving high-speed battery protected storage for writing this log file which will store write commands much faster than main disk pool. These commands will then be read from the log to write the data. Obviously validation on the pool can not happen until data is written to the drives so there is higher chance for corruption in this case and relying on this method absolutely requires battery protected storage, which makes sense as power loss can lead to comands in ram not being written to log file and application being unaware.

It seems to me that battery protected log storage is in general a good idea.

Your understanding is basically completely incorrect. You should flush almost every word above out of your memory and start over and you'll be better for it, sorry.

ZFS has exactly one way it writes data: it creates a transaction group every few seconds and flushes it to the pool. Full stop. Only option.

If you ask for a sync write, it will do a synchronous log commit to the ZIL to conform with POSIX requirements. This is always a slowdown. It is not a cache. It does not speed things up. If your ZIL is too slow to tolerate, you can put it on a separate device called a SLOG and it will be somewhat faster, but ALWAYS still slower than a non-sync write.

ZFS integrity checking is built in to the filesystem and is a basic feature. There is no chance of data corruption normally. If you go and tinker with settings, or devices fail and the data cannot be recovered, there are possibilities where you lose some data. That's it.

I'm disappointed no one pointed you at the SLOG primer.

https://www.ixsystems.com/community/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

jgreco · Aug 27, 2019

Also, incidentally, it's "FreeNAS", not "freenass".

kdragon75 · Aug 29, 2019

Also bridge interfaces do not work the same on FreeBSD as Linux. I would have to go digging up the details but it's a lot slower.
You should still be using proper switching but you will likely have other performance stumbling blocks first...

Important Announcement for the TrueNAS Community.

Write caching

Cadet

Guru

Cadet

Guru

Not strong, but bad

Cadet

Cadet

Wizard

Wizard

Powered by Neutrality

Cadet

Powered by Neutrality

Cadet

Wizard

Cadet

Wizard

Cadet

Resident Grinch

Resident Grinch

Wizard

Similar threads