My FreeNas Server with lots of Memory, tweaks?

Pointeo13 · Mar 12, 2015

I have now been running freenas at home for a couple years. Performance is great, but was wondering if anyone had any suggestions on tweaking my setup. I am even willing to pay a Mod or Admin on the forums to log into my environment and take a look to see if anything can be done. No real reason, but thought I could learn a thing or two about getting the most out of my setup. I currently use iSCSI setup for VMware (really should use NFS at some point since I do care about my data) and CIFS for my archive storage. I did enable autotune, but it looks like I should adjust the vfs.zfs.arc_max to a higher number? For the most part everything runs on 10 Gigabit network. Thanks guys!

Server Hardware
IBM x3650 M3
CPU: 2x Intel Xeon X5690 3.47GHZ
Memory: 288GB 18x (16GB 2Rx4 PC3-10600R-9-10-NO)
Raid Card: M1015 flashed to LSI9211-IT mode
Network: Dell XR997 Single Port PCIe 10GB

Storage Array
2x - SUPERMICRO CSE-846E16-R1200B (no motherboard only used to hold and power the drives.)
SAS Disks - 18x (SEAGATE 300GB 15K SAS)
SATA Disks – 12x (Seagate 3TB Desktop 6Gb/s 64MB Cache)

Switch
Netgear 12-Port ProSafe 10 Gigabit

ZFS Layout
I have two volumes setup, both volumes are Raidz2.

The picture below shows my first volume setup for my 28 virtual machines using iSCSI, 18 disks total.

The next volume is for my Archive storage which has CIFS enabled to share out the storage using 12 disks total.

Tunables (Enable autotune: yes)

Current Memory info

Current ZFS info

jgreco · Mar 19, 2015

The biggest thing that's probably hurting you is the VMware storage in RAIDZ2. This limits your number of IOPS to approximately that of a single drive.

Unlisted is the fullness and age of that VMware datastore - and/or the results of "zpool get fragmentation", all of which could give insight especially into write performance (you have lots of ARC to help boost reads).

autotune especially on older systems does not do a good job at maximizing ARC on large systems, as you see that arc_max is set at ~170G, and could probably be 250G+.

Pointeo13 · Mar 19, 2015

Thanks jgreco, I thought the ARC was a little low lol, I'll go ahead and increase it. Do you have any suggestions on a better layout for my VMware storage? So far I found as long as I create my vmdk's eager zero thick performance has been just fine. But to be fair, I have to assume that's because I have a lot of memory and I am using iSCSI with disabled sync writes (shame on me.)

cyberjock · Mar 19, 2015

Eager zero writing is going to have exactly zero performance improvement for you.

Pointeo13 · Mar 19, 2015

When I first built all my vm's the vmdk's were all thin provision disks, but unfortunately my vm's had terrible performance. So I changed it over to eager zero thick and everything was fine. I'll go back and give Thin provision another chance to see if I run into the same issue. I have vcop/operations manager setup so it should tell me if I am crazy.

cyberjock · Mar 20, 2015

zeros compress to nothing, so even if you wrote 100TB of zeros as part of thick provisioning, you'd really be writing like 5MB of actual pool data.

ZFS is also copy on write, so even if you ignore the above statement, the first time you had to write actual real-world data to the VM's virtual disk, it would be written to a *new* location, and fragmentation is guaranteed.

There's plenty of other reasons if you want to get really really detailed, but the short answer is that there's plenty of ways to recognize that eager zeros won't buy you anything at all. The only good reason to thick provision is because the vmdk's total disk space is allocated as far as the datastore is concerned. So you can't try to create 10TB of vmdks on a 5TB datastore and never expect to run out of space. ;)

Pointeo13 · Mar 20, 2015

Thanks for the info cyberjock, tomorrow I'll do a storage vmotion to put the disk back to thin, at the time I was running splunk, couple sql servers and vcops which were my heavy I/O hitters. My understanding of thin provision was that it has to go out and check to validate that the block is available, if not then erase and write. So in my mind I was thinking that extra step was causing the slowness but it was just a theory, I should have looked into it more at the time.

jgreco · Mar 20, 2015

cyberjock said:
zeros compress to nothing, so even if you wrote 100TB of zeros as part of thick provisioning, you'd really be writing like 5MB of actual pool data.

ZFS is also copy on write, so even if you ignore the above statement, the first time you had to write actual real-world data to the VM's virtual disk, it would be written to a *new* location, and fragmentation is guaranteed.

There's plenty of other reasons if you want to get really really detailed, but the short answer is that there's plenty of ways to recognize that eager zeros won't buy you anything at all. The only good reason to thick provision is because the vmdk's total disk space is allocated as far as the datastore is concerned. So you can't try to create 10TB of vmdks on a 5TB datastore and never expect to run out of space. ;)

Except that compression hoses that ... since pool space is only allocated for the compressed blocks, you can create 10TB of zero-filled vmdks on a 20TB VMFS datastore that is running on a 30TB ZFS pool, and then, because some plonker decided to put 15TB of backup files on the pool, as you write real data to those vmdk's, you fill the pool. Screwed again... just at a different (and more dangerous) level.

This isn't a simple topic.

Thin provisioned vmdks (which effectively lets VMware handle sparseness) are similar in principle to UNIX sparse files, but in practice the implementation is a lot different because it's a cluster-aware filesystem. A large part of the goal of VMFS is to avoid lots of metadata updates to the VMFS filesystem itself, effectively just locking all the blocks within a vmdk file for exclusive access by the hypervisor that is hosting the associated VM. Adding blocks to a sparse file is a tricky business, and VMFS deals with this by allocating a new set of blocks and appending it to the vmdk, which means fragmentation, since the zero-filled blocks being written to by the VM are allocated space out at the end of the vmdk. But there's also some bookkeeping overhead associated with this, so it may involve extra IOPS. In practice the effect usually isn't too bad unless you come up on a workload that's pathological, plus possible minor benefits from not having to read regions that it knows are unallocated.

Thick provisioned disks have the downside that unused blocks are stored on the filer and have to be transmitted over the network even though they're empty. One of the commonly given reasons to do thick is to guarantee the availability of space for the vmdk, but ZFS with compression (or dedup) breaks this. And of course ZFS writes cause fragmentation, so even though blocks might appear to be contiguous on the VMFS datastore, when you look at the ZFS backing store, the physical blocks aren't contiguous once you've written to them a second time.

So this is all a very complex system to fully wrap your head around. My best guess is that for peak performance, thick is still a good bet for performance because there's less indirection going on (at the VMFS layer) but for a typical VM there may not be a compelling reason not to use thin.

jgreco · Mar 20, 2015

Pointeo13 said:
Do you have any suggestions on a better layout for my VMware storage?

Well, the strange thing in your layout is that you essentially have three disks worth of IOPS and then a massive ARC. If it is doing what you need, then maybe you've arrived at an optimal solution for your workload and space needs. Don't assume you need to change just because you could do it differently.

I think the interesting thing would be to understand whether you're getting hit by fragmentation. With a bazillion GB of ARC it probably isn't heavily impacting reads but could be hurting writes. ZFS "fixes" fragmentation performance issues by throwing ARC at it, but in reality that only fixes read performance, and only if you throw enough ARC at it ... which in your case I'm guessing is the case. Write performance is a function of pool free space.

Your current setup has the distinct advantage that it is able to withstand the failure of a minimum of two disks; the loss of any single disk does not compromise redundancy, and the loss of two disks within a single vdev is still tolerable.

You can get better IOPS but have less space by reorganizing your pool as six vdevs of three-way mirrors (1.8TB usable). You would retain the level of redundancy you currently have.

You can get even better IOPS and have somewhat more space by reorganizing your pool as nine vdevs of simple mirrors (2.7TB usable). You lose redundancy if any single disk fails.

The tradeoffs are all miserable. To get lots of space you either lose IOPS or redundancy. To get lots of IOPS you lose either space or redundancy. To get lots of redundancy you lose either IOPS or space. Classic "pick any two."

The other comment I guess I'd make is this: your SAS disks are extremely small. They might be fast (probably ARE fast). However, ZFS performance degrades as a pool is filled. For iSCSI, that's probably around 50% capacity. I have no idea how full your pool is, but if it's near (or past) 50%, another optimization would be to replace those disks with much larger disks. Replacing them all with 3TB SATA drives would give massive amounts of free space, and a massive performance boost. Even if your SAS drives are 15K and the replacement drives are all 5400/5900 RPM drives. This is essentially cheating at the "pick any two" game because we've transformed space into a variable ... you can have IOPS *and* redundancy.

But to be fair, I have to assume that's because I have a lot of memory and I am using iSCSI with disabled sync writes (shame on me.)

As long as you understand the risks associated with disabling sync writes, ...

cyberjock · Mar 20, 2015

jgreco said:
Except that compression hoses that ... since pool space is only allocated for the compressed blocks, you can create 10TB of zero-filled vmdks on a 20TB VMFS datastore that is running on a 30TB ZFS pool, and then, because some plonker decided to put 15TB of backup files on the pool, as you write real data to those vmdk's, you fill the pool. Screwed again... just at a different (and more dangerous) level.

Well, that shouldn't be a problem, especially on 9.3.

I learned a painful lesson a few weeks ago when I moved a test system to TrueNAS 9.3.

Here's what I learned on a pool with 4x1TB drives in 2 mirrored vdevs and a 1TB zvol.

1. The zvol will immediately reserve the full 1TB, so you shouldn't have backup files creating problems with running out of space.
2. If you have, say, 600GB of data actually allocated (not just reserved) and you take a snapshot, the refreservation (set to 1TB at zvol creation) will make sure you *still* have 1TB of free pool space that is reserved in case the zvol ends up with all new unique data.
3. If you have about 300GB of other random files on the zpool, plus 600GB allocated to the 1TB zvol, and you do zfs replication to replicate the 1TB zvol to the said pool as part of the data migration...

300GB (files) + 600GB (zvol allocated) + 1TB refreservation for the zvol = oh shit I'm out of f*cking space

At this point the OS throws up all over itself because it doesn't even have free space to do the .system dataset writes when required. Took some single user magic to fix that shiz.

The crappy reality is that the replication to this server went okay, but the second the replication finished and the zvol was "commited" to the pool, I happened to have something like 3MB of free space left. It was pretty hilarious how close it all came to a pool that was fully allocated. So I went from a pool that was something like 15% full to a pool that was 99.7% full in a split second. Wasn't even enough time for the WebGUI to do the red light and warn me that I was taking it up the rear.

Ultimately, I trashed the refreservation for the zvol since I never do snapshots except for replication anyway and I have no intention of actually storing real files on the server.

Pointeo13 · Mar 20, 2015

jgreco said:
...So this is all a very complex system to fully wrap your head around. My best guess is that for peak performance, thick is still a good bet for performance because there's less indirection going on (at the VMFS layer) but for a typical VM there may not be a compelling reason not to use thin.

So maybe I am not so crazy, I guess I'll find out today when I move them back to a Thin provisioned disk. I have trending for the last two years, so vCops should scream at my if it detects any I/O issues. In my experience, depending on the storage technology for anything that required high I/O like splunk, sql etc. we had to do eager zero thick to meet the I/O requirements. Which I can say for certain our big storage array's like the Netapp and EMC we have always been able to get away with thin provision disks without seeing any issues, we only found issues on much smaller array's for small/medium business and were forced to do eager zero thick.

jgreco said:
...For iSCSI, that's probably around 50% capacity. I have no idea how full your pool is, but if it's near (or past) 50%, another optimization would be to replace those disks with much larger disks.

Currently I'm using 36% of my SAS disk and won't allow it go beyond the 48%, I'll add another vdev with six disks if that happens.

jgreco said:
You can get better IOPS but have less space by reorganizing your pool as six vdevs of three-way mirrors (1.8TB usable). You would retain the level of redundancy you currently have.

I think I'll go down this route by the end of the year once I'll buy more 300GB 15K SAS drives. At the time I wanted to spend my money on memory over the disk, and wasn't sure what my growth rate would be.

Thanks for all the great advice.

cyberjock · Mar 20, 2015

Consider doing SSDs instead of 15kRPM drives. You'll see much better performance. ;)

Pointeo13 · Mar 20, 2015

cyberjock said:
Consider doing SSDs instead of 15kRPM drives. You'll see much better performance. ;)

I was actually thinking of doing that, as of right now I buy my SAS drives at a much cheaper cost then SSD. I was going back and forth trying to figure if the SATA controller would be an issue since it can only handle one task at a time. But the way freenas works I wouldn't think that would be an issue for performance, just kinda of an expensive test to find out lol.

Ericloewe · Mar 20, 2015

An M1015 won't saturate all eight 6Gb/s links at once, but it should do well enough to saturate the PCI-e 2.0 x8 connection.

jgreco · Mar 20, 2015

Pointeo13 said:
I was actually thinking of doing that, as of right now I buy my SAS drives at a much cheaper cost then SSD. I was going back and forth trying to figure if the SATA controller would be an issue since it can only handle one task at a time. But the way freenas works I wouldn't think that would be an issue for performance, just kinda of an expensive test to find out lol.

This isn't IDE and it isn't 1995 anymore. Modern SATA drives support NCQ and have for years.

jgreco · Mar 20, 2015

cyberjock said:
1. The zvol will immediately reserve the full 1TB, so you shouldn't have backup files creating problems with running out of space.

I imagine that depends on whether or not you create it as a sparse zvol.

Pointeo13 · Mar 20, 2015

jgreco said:
This isn't IDE and it isn't 1995 anymore. Modern SATA drives support NCQ and have for years.

True but I shouldn't used the word task, I meant full duplex for sas vs half duplex for sata.

jgreco · Mar 21, 2015

Pointeo13 said:
True but I shouldn't used the word task, I meant full duplex for sas vs half duplex for sata.

That's like worrying about whether your drive uses longitudinal magnetic recording or perpendicular magnetic recording. It's a big fat "who the heck cares." Both SAS and SATA link speeds are vastly greater than the speed of the underlying mechanism to read and write data, and ZFS is accessing the pool in parallel when possible anyways.

SAS will tend to be a little bit faster (actually lower latency) due to the duplex issue, sure, but SAS drives also tend to be a lot faster because they tend to have higher RPM, tend to have faster seek times, and may have deeper command queues. This is a win for traditional servers where you're requesting individual blocks all over the place. However, with ZFS, a lot of that fades, because ZFS is usually asking for large amounts of data (up to 128K blocksize!), and ZFS is fronting an array of drives with a massive ARC and a massive transaction group buffer for writes, so especially for the duplex issue, that benefit to SAS is seriously minimized.

Which gets back to my SATA 3TB suggestion, and this one:

cyberjock said:
Consider doing SSDs instead of 15kRPM drives. You'll see much better performance. ;)

which is "true" but an expensive way to get there.

The problem is that the SAS speed differential vs SATA is at best only maybe 2X, whereas the ZFS tax for writing random block data to a fullish pool can be kind of oppressive. The Delphix guys have a great blog post on this. Look in particular at the last graph, steady state performance. And let's just pretend your current array is his so we can talk numbers. The throughput at 50% pool capacity is approximately 1000. Now, let's ditch the SAS drives and replace all your drives with SATA 3TB drives. This gives you a 10X larger pool, but, for the point I'm about to make, you need to be using the same amount (not percentage, but rather amount) of space. If you had a 3TB pool with 1.5TB used before, now you have a 30TB pool with 1.5TB used. That's 5% full. The Delphix graph doesn't address that, but it seems clear that it should be in the at-least 8000 throughput range. Now, let's aggressively tax that at 50% for the use of SATA instead of SAS, which I'm not really convinced of... you're still at 4000 throughput, 4 times the SAS array.

Of course, an array built out of 3TB SAS drives would be somewhat faster than an array built out of 3TB SATA drives, but part of the trick is to figure out how to build affordable storage.

Pointeo13 · Apr 6, 2015

Wanted to give an update, I destroyed my Raidz2 for the VMware Storage, added six more 300GB SAS drives (for total of 24 drives) and made 12 vdev with each vdev being mirrored.

I also installed second LSI card, took my two Intel x25-e 64GB SSD drives, added them to the new pool in a mirror and put the log/zil onto it. Since I’m using iscsi I went ahead and used the command zfs set sync=always on the pool. The reason I did this was because my storage controller 10G nic failed after two years, which caused a couple domain controllers to corrupt, wasn’t a big deal, restored from backups and everything was fine again (I do backups twice a day.) I have purchase a second 10G switch and now have two 10G nic’s installed in the storage controller, this way I can go back to sync=disabled if I want.

I’m also excited to say I have bought another storage array, the Supermicro SuperChassis CSE-847E16-R1K28LPB which has 36 HDD bays. I did this because last week I officially filled the entire first Supermicro storage chassis 24 bay all with 300GB SAS drives and the second Supermicro storage chassis 24 bay all with 3TB drives. Need more room to expand. :D

diehard · Apr 6, 2015

jgreco said:
The biggest thing that's probably hurting you is the VMware storage in RAIDZ2. This limits your number of IOPS to approximately that of a single drive.

Unlisted is the fullness and age of that VMware datastore - and/or the results of "zpool get fragmentation", all of which could give insight especially into write performance (you have lots of ARC to help boost reads).

autotune especially on older systems does not do a good job at maximizing ARC on large systems, as you see that arc_max is set at ~170G, and could probably be 250G+.

This is the first i have heard of the "zpool get fragmentation" command.

What is an acceptable number? How much of a performance loss is .. 20% fragmentation .. or 50%?

Important Announcement for the TrueNAS Community.

My FreeNas Server with lots of Memory, tweaks?

Explorer

Resident Grinch

Explorer

Inactive Account

Explorer

Inactive Account

Explorer

Resident Grinch

Resident Grinch

Inactive Account

Explorer

Inactive Account

Explorer

Server Wrangler

Resident Grinch

Resident Grinch

Explorer

Resident Grinch

Explorer

Contributor

Similar threads