Is FreeNAS (ZFS) the right tool for the job?

rkelleyrtp · Aug 23, 2018

Greetings everyone,

I am looking to build a couple of very high performance NFS servers for the data center to host 14 ESX servers totaling about 500 VMs. Each server will have 8x Samsung 860 2TB SSDs, 32G RAM, and a 2x Intel 520 10G NICs. My intent is to create a single RAID-10 device (8TB usable) for maximum performance (IOs). I don't need sub-volumes, snapshotting, remote replication, etc as I already have a backup solution in place.

While I am sure FreeNAS is up to the task, I am wondering if ZFS is the right tool for the job. Based on what I have read, VMFS datastores should not be put on COW systems; instead they should live on XFS or EXT4 filesystems. That said, what features will I gain with ZFS (without COW) over standard mdadm RAID-10?

Thanks.

danb35 · Aug 23, 2018

rkelleyrtp said:
ZFS (without COW)

There is no ZFS without COW. What will you gain with it? Data integrity, for one--everything is checksummed on write. It's trivially easy to expand the pool with another mirrored pair, for another.

rkelleyrtp · Aug 23, 2018

Thanks for the reply. The servers have a max of 8 drive bays, so expanding the pool won't be an issue. As for data integrity, we keep lots of backups, and, if a drive starts to fail, it will get replaced immediately. I understand the case for bit-rot, but from what I can see thus far, moving from XFS to ZFS won't provide that much (if any) gains.

That said, will ZFS give us more performance than mdadm if we used RAID-5 (or RAID-6) instead of RAID10?

danb35 · Aug 23, 2018

rkelleyrtp said:
if a drive starts to fail, it will get replaced immediately.

...once you tell that it's starting to fail. Bit-rot really is the point, and btrfs is the only other FS that even tries (if poorly) to address it. Without checksums (hashes, actually), there's no way to know what what you read from the disk is the same as what you last wrote there.

I can't directly address speed comparisons, other than that generally, ZFS doesn't place top priority on performance.

HoneyBadger · Aug 23, 2018

rkelleyrtp said:
Each server will have 8x Samsung 860 2TB SSDs, 32G RAM, and a 2x Intel 520 10G NICs.

If you're hoping to put fourteen hosts and 500 VMs on these servers, you'll want more than 32GB of RAM each. How many storage servers are you planning to build overall, and how come you're thinking "several independent ones" rather than "one beefy monster box"?

Regarding your drives, are these 860 Pro or 860 Evo? The former is 3D MLC and has a higher write endurance rating than the latter which is 3D TLC, which you'll want if these are going to be hammered with writes. (I assume that high-endurance "enterprise" drives like the PM863a/SM863a aren't in the budget.)

Only other thing I might make room for is an Optane SLOG device. Yes, even with all-SSD vdevs, you can still benefit from SLOG, and Optane is fast enough to keep up with 10GbE.

rkelleyrtp said:
Based on what I have read, VMFS datastores should not be put on COW systems

Have you got a link to these articles? I'd like to do some reading.

With regards to backups/data integrity/etc - remember that a backup of bad data is still bad data, and that speed is irrelevant if you're just returning the wrong answer really fast. ZFS does have a high cost of entry to "doing it right" but the results are pretty excellent in my experience.

rkelleyrtp · Aug 23, 2018

HoneyBadger said:
If you're hoping to put fourteen hosts and 500 VMs on these servers, you'll want more than 32GB of RAM each. How many storage servers are you planning to build overall, and how come you're thinking "several independent ones" rather than "one beefy monster box"?

I already have the servers on hand, thus I don't want to purchase "one beefy monster box". Also, it is better to have a few smaller servers than one large server (upgrades/maintenance, unexpected outages, etc)

HoneyBadger said:
Regarding your drives, are these 860 Pro or 860 Evo? The former is 3D MLC and has a higher write endurance rating than the latter which is 3D TLC, which you'll want if these are going to be hammered with writes. (I assume that high-endurance "enterprise" drives like the PM863a/SM863a aren't in the budget.)

Only other thing I might make room for is an Optane SLOG device. Yes, even with all-SSD vdevs, you can still benefit from SLOG, and Optane is fast enough to keep up with 10GbE.

Thanks, I will keep this in mind.

HoneyBadger said:
Have you got a link to these articles? I'd like to do some reading.

Here are a few I have been reading:
https://www.reddit.com/r/linuxadmin/comments/35aomm/kvm_with_zfs
https://www.reddit.com/r/zfs/comments/5n2hrq/zfs_cow_ok_for_databasesvms/

To be fair, these articles are a little dated and generally refer to BTRFS and ZFS on COW filesystems. Maybe things have changed recently - especially on an all-SSD system (fragmentation not an issue)?

Also, while I think ZFS is cool, past experience using ZFS on Linux (0.6.5 days) was painful (16-bay Supermicro server with 16x2TB seagate enterprise drives, 32G RAM):
* Required too much RAM to operate properly
* Read/write performance tended to degrade over time (from hundreds-MB/sec to MB/sec) for no apparent reason
* Took too much time to properly debug performance issues

After spending way too much time trying to tune the box, I reformatted the system with mdadm/XFS and moved on. No more performance issues.

At this point, I am (re)evaluating ZFS to see if it will perform as good as XFS on these high-spec'd systems.

HoneyBadger said:
With regards to backups/data integrity/etc - remember that a backup of bad data is still bad data, and that speed is irrelevant if you're just returning the wrong answer really fast. ZFS does have a high cost of entry to "doing it right" but the results are pretty excellent in my experience.

Agreed and understood.

HoneyBadger · Aug 23, 2018

TL;DR: You have several systems. Install FreeNAS on one, Linux/mdadm+xfs on another, maybe a third wildcard system, and then THUNDERDOME!

rkelleyrtp said:
I already have the servers on hand, thus I don't want to purchase "one beefy monster box". Also, it is better to have a few smaller servers than one large server (upgrades/maintenance, unexpected outages, etc)

Gotcha. Not having a single point of failure is good, separating workloads is also good, but having too many "islands of storage" isn't. Are you thinking "two or three" or "six or seven"?

For RAM, is there an option to upgrade beyond 32GB of RAM or are they something like early E3 Xeons that max out at that amount? Even with all-flash back-end, RAM is still a whole lot faster than SSD, and I'd say shoot for 64GB or more.

rkelleyrtp said:
Here are a few I have been reading:
https://www.reddit.com/r/linuxadmin/comments/35aomm/kvm_with_zfs
https://www.reddit.com/r/zfs/comments/5n2hrq/zfs_cow_ok_for_databasesvms/

To be fair, these articles are a little dated and generally refer to BTRFS and ZFS on COW filesystems. Maybe things have changed recently - especially on an all-SSD system (fragmentation not an issue)?

I was hoping more for "case study" not "Reddit post" ... those are indeed old news but even in them I see people reporting that their KVM on ZFS setup works fine, which lines up with my personal experience (as well as VMFS on ZFS) - you just need to make sure you set your pools/vdevs/datasets and their tunables correctly. (See bottom of post.) Fragmentation on all-SSD is less of an issue because "seek time" is effectively 0ms.

rkelleyrtp said:
Also, while I think ZFS is cool, past experience using ZFS on Linux (0.6.5 days) was painful (16-bay Supermicro server with 16x2TB seagate enterprise drives, 32G RAM):
* Required too much RAM to operate properly
* Read/write performance tended to degrade over time (from hundreds-MB/sec to MB/sec) for no apparent reason
* Took too much time to properly debug performance issues

After spending way too much time trying to tune the box, I reformatted the system with mdadm/XFS and moved on. No more performance issues.

At this point, I am (re)evaluating ZFS to see if it will perform as good as XFS on these high-spec'd systems.

ZoL has improved quite a bit since then, I can't recall but I believe there were quite a few bugs in the code back then that could cause weird performance issues. In regards to overall performance, FreeNAS does a good job with its default tunables for most scenarios, but assuming you go ahead with this one you'd probably want to change a few things.

1. I'm reasonably sure that all Samsung 3D NAND drives use an 8KB internal page size. You'll want to ensure that your vdevs and pool are created with ashift=13. Otherwise, your drives will all be doing a read-modify-write for every 4K block.

2. Set recordsize=16K on the NFS export datasets. Otherwise, when you build a VM there, it will create the vdisk with big chunky 128K records. 8K would match the ashift size directly, but your sequential throughput will suffer.

3. atime=off - you don't need to update this every time you read from a .VMDK

4. Set the tunable vfs.zfs.metaslab.lba_weighting_enabled: 0 - by default the ZFS metaslab indicator treats all drives like spinning platters, where the "outer edge" is spinning faster and has better performance. Since you have all-flash, you can turn off the LBA weighting.

Again, back to my TL;DR at the top though; you have several systems and presumably no one breathing down your neck to "implement this right now!"

Have yourself a Storage War!

rkelleyrtp · Aug 23, 2018

HoneyBadger said:
TL;DR: You have several systems. Install FreeNAS on one, Linux/mdadm+xfs on another, maybe a third wildcard system, and then THUNDERDOME!

Gotcha. Not having a single point of failure is good, separating workloads is also good, but having too many "islands of storage" isn't. Are you thinking "two or three" or "six or seven"?

For RAM, is there an option to upgrade beyond 32GB of RAM or are they something like early E3 Xeons that max out at that amount? Even with all-flash back-end, RAM is still a whole lot faster than SSD, and I'd say shoot for 64GB or more.

I was hoping more for "case study" not "Reddit post" ... those are indeed old news but even in them I see people reporting that their KVM on ZFS setup works fine, which lines up with my personal experience (as well as VMFS on ZFS) - you just need to make sure you set your pools/vdevs/datasets and their tunables correctly. (See bottom of post.) Fragmentation on all-SSD is less of an issue because "seek time" is effectively 0ms.

ZoL has improved quite a bit since then, I can't recall but I believe there were quite a few bugs in the code back then that could cause weird performance issues. In regards to overall performance, FreeNAS does a good job with its default tunables for most scenarios, but assuming you go ahead with this one you'd probably want to change a few things.

1. I'm reasonably sure that all Samsung 3D NAND drives use an 8KB internal page size. You'll want to ensure that your vdevs and pool are created with ashift=13. Otherwise, your drives will all be doing a read-modify-write for every 4K block.

2. Set recordsize=16K on the NFS export datasets. Otherwise, when you build a VM there, it will create the vdisk with big chunky 128K records. 8K would match the ashift size directly, but your sequential throughput will suffer.

3. atime=off - you don't need to update this every time you read from a .VMDK

4. Set the tunable vfs.zfs.metaslab.lba_weighting_enabled: 0 - by default the ZFS metaslab indicator treats all drives like spinning platters, where the "outer edge" is spinning faster and has better performance. Since you have all-flash, you can turn off the LBA weighting.

Again, back to my TL;DR at the top though; you have several systems and presumably no one breathing down your neck to "implement this right now!"

Have yourself a Storage War!

Thanks for the great information! I am looking at deploying 3 servers based on the Supermicro X9DRW-7/iTPF motherboard - each with a single Intel E5-2609 v2 (4-core 2.5GHz) CPU. The RAM can be upgraded to 1TB if necessary, and I already have a box full of spare RAM if I need to go to 64G.

Do you have any experience with running VMFS datastores on recent ZFS installs? Again, my main concern is performance down the road. Removing a server from production is not an easy chore - regardless of the testing we do up front.

Targeting the VMFS use case, what ZFS config would you recommend (mirrored zvols, RAID2z, RAID3z, etc)? I need a good balance of IOPs vs storage space - leaning more on IOPs since I have 3 servers. The servers be using an LSI pass-through controller (no HW RAID) on a PCIE-16x riser card.

Finally, should I use ZFS' built-in NFS driver or the OS NFS driver? I am hoping to leverage NFS v4.1 for maximum performance.

Thanks again for the great input. I hope we can leverage ZFS if it performs well.

HoneyBadger · Aug 23, 2018

rkelleyrtp said:
Thanks for the great information! I am looking at deploying 3 servers based on the Supermicro X9DRW-7/iTPF motherboard - each with a single Intel E5-2609 v2 (4-core 2.5GHz) CPU. The RAM can be upgraded to 1TB if necessary, and I already have a box full of spare RAM if I need to go to 64G.

My advice would be "stuff as much RAM as you can in there" - ZFS loves memory, and the more you have, the better your ARC hit rate will be.

rkelleyrtp said:
Do you have any experience with running VMFS datastores on recent ZFS installs? Again, my main concern is performance down the road. Removing a server from production is not an easy chore - regardless of the testing we do up front.

Yes, although most of them are still disk-based vdevs. It's not so much "performance degradation over time" but "performance degradation as it fills" - one array I overbuilt from day one, it's 12x1TB SAS drives in mirrors, but of that 6TB pool I've only carved out about 3TB worth of zvols, and LZ4 compression mashed that down to about 1.6TB allocated. Still runs about the same as it did five years ago.

With all-flash, you can get away with a fuller pool and feel less of an impact, because you don't have the same massive seek penalty for highly fragmented data - but you still have to give your NAND some "breathing space" to handle garbage collection. You could do this by under-provisioning the drives themselves using tools to adjust the Host Protected Area (eg: making it so that a 960GB drive is presented to the OS as 800GB) and then the drive's internal wear-leveling and garbage collection will always have a bit of "slack space" to handle things quickly if you hit them with a burst of writes.

rkelleyrtp said:
Targeting the VMFS use case, what ZFS config would you recommend (mirrored zvols, RAID2z, RAID3z, etc)? I need a good balance of IOPs vs storage space - leaning more on IOPs since I have 3 servers. The servers be using an LSI pass-through controller (no HW RAID) on a PCIE-16x riser card.

Personally, I always use mirrors for live VMs. Losing 50% of your available space hurts, but it's the highest performance solution. Rebuilds are also much faster on mirror vdevs.

RAIDZ2+ is okay for backup targets or a datastore that's nothing but an ISO repository, but I wouldn't dream of running VMs with actual performance aspirations off it. I won't use RAIDZ1 anymore at all.

rkelleyrtp said:
Finally, should I use ZFS' built-in NFS driver or the OS NFS driver? I am hoping to leverage NFS v4.1 for maximum performance.

Thanks again for the great input. I hope we can leverage ZFS if it performs well.

Unfortunately there's no pNFS support in FreeBSD 11, I believe it's milestoned for FreeBSD 12. That said, I don't think you'll have performance issues with aggregated 10GbE and multiple hosts. There's another user in a similar situation, and he uses src-mac load balancing from the hosts to ensure that all of the links in the LACP bond get used at least somewhat. Tagging him in - @Elliot Dierksen

End rambling.

Elliot Dierksen · Aug 23, 2018

HoneyBadger said:
There's another user in a similar situation, and he uses src-mac load balancing from the hosts to ensure that all of the links in the LACP bond get used at least somewhat. Tagging him in - @Elliot Dierksen

I have two FreeNAS servers that each have a 2 port LACP LAGG to my storage network, and they share via NFS. Each of my ESXi servers has a 2 port 10G NIC with one going to the storage network, and the other going to the Vmotion network. A LAGG doesn't bond, it load balances. That means 1 conversation outbound goes on one particular member of the LAGG. In most cases, you would load balance based on destination but that doesn't work in this scenario. It doesn't work because everything in the LAGG is going towards the single IP/MAC address of the FreeNAS. To fix this, I configure the load balancing method in the 10G switch to balance based on source MAC going towards FreeNAS. That said, I haven't ever managed to get over 8.5G throughput so I have never seen any actual congestion. I am pretty happy with mid 8G performance.

rkelleyrtp · Aug 23, 2018

Elliot Dierksen said:
I have two FreeNAS servers that each have a 2 port LACP LAGG to my storage network, and they share via NFS. Each of my ESXi servers has a 2 port 10G NIC with one going to the storage network, and the other going to the Vmotion network. A LAGG doesn't bond, it load balances. That means 1 conversation outbound goes on one particular member of the LAGG. In most cases, you would load balance based on destination but that doesn't work in this scenario. It doesn't work because everything in the LAGG is going towards the single IP/MAC address of the FreeNAS. To fix this, I configure the load balancing method in the 10G switch to balance based on source MAC going towards FreeNAS. That said, I haven't ever managed to get over 8.5G throughput so I have never seen any actual congestion. I am pretty happy with mid 8G performance.

Thanks for the info. In my case, I have 2x dual-port Intel 520 10G NICs connected to a pair of Cisco Nexus 3Ks running VPC. My goal is to run LACP (linux bond mode 4) to the switches - much like I setup all our other Linux boxes. I realize a single stream won't get above 10Gb/sec, but I do expect higher than 10Gb/sec with lots of clients connected.

RickH · Aug 24, 2018

I run a total of 6 FreeNAS servers that serve as VMware datastores for 14 ESXi hosts; however I use ISCSI instead of NFS for access so I can't comment on all of your setup...

rkelleyrtp said:
Based on what I have read, VMFS datastores should not be put on COW systems; instead they should live on XFS or EXT4 filesystems.

This isn't entirely accurate, VMware datastores backed by ZFS are quite capable as long as your system is implemented correctly:

As others have stated, 32 GB RAM isn't enough - every one of my FreeNAS boxes has at least 128GB, I would consider 64GB an absolute minimum, but keep in mind that RAM will be the single biggest performance boost to your system (assuming your network can keep up) - 256GB or higher isn't unreasonable if you're looking for maximum performance.
I'm sure you've read about the performance degradation when you fill your pool past 80% - I've seen this firsthand, and would say that for a pool that's storing VM data, you should treat 80% as an absolute hard-limit. In practice, there is a noticeable drop in performance once the pool hits about 65% utilization and it absolutely falls off a cliff at 80%.
The biggest issue of using COW is fragmentation over time - having a pool entirely backed by SSD drives will greatly limit the effect of fragmentation - my practice for a pool that's severely fragmented is to migrate the VM's to different storage and then recreate the pool. I've only had to do this with 2 of my datastores in over 3 years of using FreeNAS as backed (both pools stored the datastores for highly active Sql servers)
The performance of your pool is going to increase with each vDev you add. Most of my systems are comprised of 12 spinning 7200rpm SATA drives with ssd backed LOG and L2ARC's. I've experimented with configs that consist of 6 - 2 drive mirrored vDevs, 4 - 3 drive RaidZ vDevs, 2 - 6 drive RaidZ2 vDevs, and 1 -12 drive RaidZ2 vDev... The difference in speed across the different configs was quite dramatic. I've settled on the 6 - 2 drive mirrored setup for my Sql datastores and the 2 - 6 drive RaidZ2 config for the less performance intensive setups.

Elliot Dierksen said:
A LAGG doesn't bond, it load balances... ...That said, I haven't ever managed to get over 8.5G throughput so I have never seen any actual congestion. I am pretty happy with mid 8G performance.

This is the reason I chose ISCSI over NFS... The LACP standard for load-balancing works best with many different clients accessing many different endpoints and simply isn't effective in a storage network scenario. Utilizing ISCSI multipathing provides a much more controllable and effective method of distributing the load over the available network bandwidth. I've been able to get read/write performance of over 1,600 MiB/sec from a VM utilizing a FreeNAS SSD backed pool over ISCSI multipathing using a dual port Intel 520 NIC... ...Just something to consider

Elliot Dierksen · Aug 24, 2018

RickH said:
I've been able to get read/write performance of over 1,600 MiB/sec from a VM utilizing a FreeNAS SSD backed pool over ISCSI multipathing using a dual port Intel 520 NIC... ...Just something to consider

Understood. You probably have a much higher IOP requirement than I do. My stuff is mostly a lab, so the requirements are not too demanding. I also like being able to see all the files in the file system on the FN box when I SSH into it. There is also a laziness point that I already understand NFS. I would have had to figure a few more things out to do ISCSI, and it just wasn't worth it to me since NFS was meeting my needs. Good to know that the ISCSI stuff is capable of that level of throughput.

Important Announcement for the TrueNAS Community.

Is FreeNAS (ZFS) the right tool for the job?

rkelleyrtp

Cadet

danb35

Hall of Famer

rkelleyrtp

Cadet

danb35

Hall of Famer

HoneyBadger

actually does care

rkelleyrtp

Cadet

HoneyBadger

actually does care

rkelleyrtp

Cadet

HoneyBadger

actually does care

Elliot Dierksen

Guru

rkelleyrtp

Cadet

RickH

Explorer

Elliot Dierksen

Guru

Similar threads