NFS Performance with VMWare - mega-bad?

jgreco · Apr 13, 2013

pbucher said:
Ok well let's just chalk this up to matter of opinion.

Because it isn't a matter of opinion...?

There are lots of things you can do in life that are Completely Fscking Stupid. Failing to wear a seat belt because you don't believe they save lives isn't a "matter of opinion." It's ignoring the reality that if you hit an object at highway speeds, you may be propelled out over the hood and onto the pavement, impacting several things with lethal force along the way. It may be your opinion that seatbelts suck. It is a matter of statistics that seatbelts save lives. See the difference?

UNIX allows you to "rm -rf /". Why is the option there?

At this point, a decent SSD for ZIL is really not all that expensive, and someone investing in ESXi and a SAN rather than local datastores is selecting an expensive storage option. A hundred bucks more for a 40GB Intel 320 seems like a small price to pay to gain both performance AND data integrity assurance.

cyberjock · Apr 13, 2013

There he is. I knew he'd show up as soon as someone recommended sync=disabled. This whole ESXi NFS, iSCSI, sync writes thing is near and dear to his heart and he's tried to stop people from being stupid, but still watches the noobs do dumb things and ask why their zpool is unmountable weeks later.

jgreco · Apr 14, 2013

Well, the flip side to this whole thing is, the reality of it all is that VMware has been obstinate about the issue and refuses to allow it to be configured/controlled by the virtualization host. This leaves administrators confused and desperate to "fix the slow storage." I am not unsympathetic to the issue.

Further, a whole generation has been taught that the way to "fix" this is to disable NFS sync, without an explanation as to why, or more specifically, why this is bad. And let's face it, the issue is a bit abstract, and the likely failure modes involve an interaction (vmdk disk blocks "written" but not committed) that might be unimportant to a VM or might be totally train wreck. VMware at least understands that they do not know the importance of the data and therefore they treat it all as important.

Which kind of sucks unless you design for it. So let's consider the non-FreeNAS angle.

On a local disk based datastore, M1015-in-IR mode with two Momentus XT's in RAID1, write performance ends up being about 94MBytes/sec sequential, read performance around 100MB/sec.

On a local SSD based datastore, M1015-in-IR mode with some various SATA3 SSD's in RAID1, write performance ends up being about 262MB/sec, read around 232MB/sec.

However, to achieve those speeds, I have to be writing 1MB chunks. Writing smaller 8KB chunks results in a fraction of the speed, because of the latency in sending things from the VM, through the vmfs layer, through the controller, and then actually getting the acknowledgement back from the datastore that it has been written. 17.7MB/sec for the disks.

Now, I can add a BBU write cache based RAID controller to the host, and while I don't have one handy to play with, I can tell you that the numbers for writes in general tend to be higher, and there's a much less noticeable decrease in speed based on chunk size, because the RAID controller is short-circuiting the sync writes. And this is still the heart of the matter. It is a problem even with locally attached storage, it is just not as significant an issue due to the low latency of local storage.

But local host storage is at least somewhat inconvenient. The ability to shuffle VM's around between hosts makes shared network storage attractive. And technologies such as FC can be expensive and challenging for a small IT department to deploy, and often kind of energy-hungry. NFS is common and readily available in a huge selection of footprints.

So here it becomes important to properly resource your NAS environment. Really, if you're going to spend money on network storage for your VM's, anywhere from many hundreds to several thousands of dollars most likely, refusing to install a ZIL is kind of like sticking 4GB of RAM in an i386 desktop box from 2005 with a Realtek 10/100 ethernet and then wondering WHY IS THIS SO SUCKY??!?

The latency inherent in going over the network for NFS is always going to make NFS (or iSCSI) less attractive than a well-designed local datastore. So even if you run your NFS on a md memory disk on 10GbE, you may not see earthshaking performance.

In the end? You either understand that there are overall limits to the technology or you don't. If you choose to use sync=disabled rather than getting a ZIL, you need to realize the risks both to your VM's and to your ZFS pool. If you want to be comfortable that your pool is safe but don't mind some risk to your VM's, use iSCSI with sync=standard. The ultimate solutions, at least under FreeNAS, involve actually implementing the technology to be able to quickly commit data. In the ZFS paradigm, that's a ZIL.

cyberjock · Apr 14, 2013

Hmm. You really made me start asking myself some questions. I do have a RAID controller with 4GB of on-card cache. I have always disabled the write cache because any time I have done any real-world test(scrubs, mutiple access to my data, etc) performance was often as little as 20% of maximum speeds. I'm wondering if cheating the sync writes out of their delay with a RAID controller with alot of on-card cache, BBU, and write caching enabled might help the writes. The big question would then be if the help in writes will more than offset the loss in total zpool performance. I know many controllers won't use a write cache in JBOD mode, but my card in particular(Areca 1280ML) will still use a write cache in JBOD mode if it is enabled.

Jgreco - If you can come up with a good test for this that doesn't involve actually creating NFS shares or iscsi stuff I'd be more than happy to test it with some of my spare hardware. If you can't think of a way without making an NFS share or iSCSI and want to do some of your own testing I could probably put a box together you could SSH into remotely. PM me if you are interested.

jgreco · Apr 14, 2013

The reason that BBU write cache works well when connected to an ESXi host is because most VM writes tend to be random and are fairly likely to be small; in such cases, a write cache is a win if a block is being updated repeatedly (updated-in-cache, single disk write) or if there are lots of small writes. NCQ is great and all but with only 31 outstanding commands, and 512 bytes per block, that could be as little as 15KB outstanding. A 1GB controller with 512MB allocated for write cache could potentially maintain a list of thousands of blocks to write, slowly sweeping back and forth updating the disk using locality rules and NCQ to do the dirty work. That last might be a bit wishful for "affordable" RAID controllers... but the point remains.

The design of ZFS will tend to be at odds with that, though. ZFS implements its own write clustering strategy, and it is based on the CoW filesystem design. As such, it blasts out massive blocks of writes to each disk as part of a transaction group flush. In general, this is going to crush a controller with small amounts of write cache, because the cache is FIFO'ing data to the disk, never ever seeing a block that can be updated in-cache, so handling it through the cache is largely an exercise in futility. The last part of the flush may complete faster (put in cache but not yet written). For a sufficiently large cache, you might actually flush most or all of a txg into it. It still doesn't really take advantage of the write cache because the CoW design of ZFS basically ensures that two writes to the same file end up on different disk blocks. ZFS really does intend to be THE write cache, but it is a volatile write cache, which brings us back to the ZIL.

For what it is worth, on a default FreeNAS install with 32GB of RAM, the size of a transaction group starts out at 4GB - you can kind of think of that as volatile write cache and that's sort-of wrong, sort-of correct. You can get a feel about why ZFS can be so awesome as you throw RAM at it - it actually uses it! But that 4GB transaction group is a RAID card cache killer.

What would almost certainly help, assuming a sufficiently large hardware RAID write cache, is the sort of tuning I did in (oh hell he's mentioning it again) bug 1531. Reducing the maximum size of transaction groups would mean that a transaction group flush could be dumped out to the RAID controller and left in its write cache. You would still not get any win from multiple updates to a single block in the write cache because that behaviour doesn't really happen with ZFS, but the ability to delegate the job to the RAID controller and then immediately move on might be a win under some system loads.

Your thoughts suggest another comment; I know a lot of people around here are obsessed with their "maximum speeds". I'm typically more concerned with overall responsiveness, because a NFS server that can write at 125MB/sec is not useful if it is generating a storm of "nfs server foo: not responding" / "nfs server foo: alive again" on the clients because it is going catatonic. My guess is that all the ESXi users either do or should care about that as well... but optimizing the system for responsiveness involves some tradeoffs in max speed.

pbucher · Apr 15, 2013

The RAID controller with cache thing is interesting, because I've basically replaced my locally attached ESXi storage which used to live on a hardware RAID 10 array with a ZFS virtual SAN and in a lot of ways the RAM of the virtual SAN VM has replaced the ram of the RAID controller. That said I'm not even going to remotely pretend that the FreeNAS or FreeBSD VM is half as stable as the firmware of a tier 1 hardware RAID controller. Even still I've yet to kernel panic my FreeNAS VM and it gets some very heavy loads at times.

Anyways back to the sync=off thing, exactly what is the suggested way to make ZFS workable when you can't throw a SSD at the problem? While I use ZeusRAM's for my ZILs in my data center, I've got a dedicated host at a hosting provider that is basically running a very similar hardware/software setup to what I run in my own data center so the setup if well test. My problem is I can't buy a ZeusRAM(I can get some prosumer grad SSD thrown in the box for some hefty monthly charge) and have it popped into that box and I'd need some high end performance out of the ZFS setup on that box. The good news is I use the ZFS volume to hold replicas of the production VMs on that box so a complete loss of the volume won't be the end of my life, plus there is no activity on the volume most of the time so the window of file system damage is while the replication is running. The replication process is why I need performance because the nightly replication process is very I/O intense and want to minimize the backup window(it's 5 hours as it is). I then use ZFS replication to send the VM replicas offsite, hence why I'm not just using another hardware RAID controller in the box. I hear you about iSCSI, but I like(and some times simply need to) having file level access to the contents of ZFS volume(using the ESXi command line isn't the most fun place to duplicate and move files around plus the overhead of going through NFS or iSCSI). Thoughts anyone........

pbucher · Apr 18, 2013

A few observations:

I setup a new ZFS vol on my existing test/develop SAN and shared it via iSCSI to my ESXi server. At first it appeared to be the magic fix everyone talks about, until I noticed that it wasn't touching the ZIL device at all. So it basically was running as sync=disabled, hence it's advantage over NFS. Also meaning that my ESXi VMs are exposed to the same danger of corrupted data if the SAN went down hard. I then set sync=always on the vol and found it's performance to be very similar to my NFS share and the ZIL device once again got plenty of traffic.

So ESXi & iSCSI is not some sort of magic fix. So my next question is exactly where in the Oracle ZFS documentation does it say never to use sync=disabled in production. I've read the current ZFS documentation on the sync setting and I interpret it as don't turn this on if you don't know what the risks are and how to manage them(in other words if you don't know why a SAN must have a UPS attached and configured for automated shutdown in case of power outage then stay away from this setting or if you hardware/OS combo isn't rock solid then the same applies).

jgreco · Apr 18, 2013

pbucher said:
I setup a new ZFS vol on my existing test/develop SAN and shared it via iSCSI to my ESXi server. At first it appeared to be the magic fix everyone talks about, until I noticed that it wasn't touching the ZIL device at all. So it basically was running as sync=disabled, hence it's advantage over NFS. Also meaning that my ESXi VMs are exposed to the same danger of corrupted data if the SAN went down hard. I then set sync=always on the vol and found it's performance to be very similar to my NFS share and the ZIL device once again got plenty of traffic.

The difference is that sync=disabled is potentially placing the integrity of the filesystem itself at risk, whereas sync=standard means that ZFS metadata updates are protected but updates to the iSCSI file extent contents are not protected. It is a matter of the scope of the risk.

I have to say, I'm sorry you didn't search the forum and find my post from last week about this, but in your defense I had trouble finding my post too, so don't feel too bad.

I would hope that "everyone" doesn't think of it as a "magic fix," because it clearly isn't. If you are doing something that requires sync writes, you have to eat the latency and find a way to commit to stable storage. Preferably quickly to get acceptable performance. That's still SLOG for ZFS.

So ESXi & iSCSI is not some sort of magic fix. So my next question is exactly where in the Oracle ZFS documentation does it say never to use sync=disabled in production. I've read the current ZFS documentation on the sync setting and I interpret it as don't turn this on if you don't know what the risks are and how to manage them(in other words if you don't know why a SAN must have a UPS attached and configured for automated shutdown in case of power outage then stay away from this setting or if you hardware/OS combo isn't rock solid then the same applies).

Oracle's purchase of Sun has done some damage to legacy Sun documentation and other information. I don't know offhand. The guy who implemented this on Solaris talks about it here, which suggests that on Solaris, at least, ZFS on-disk consistency is not (supposed to be) affected. However, this may not be the case with FreeBSD's version, and either way, it is worth pondering the fact that this mode of operation is violating some of the design principles of ZFS, which means that there could be filesystem-eating bugs that result, even on Solaris. I do recall seeing more specific debate of the issue at some point.

Hey, I took my evil pills this morning, and I just had a bright idea.

I'm wondering if maybe SSD is not the ideal SLOG device. What if... and yes this is a bit crazy... you were to put SLOG on a BBU JBOD hard disk? You get the extreme low latency of commits being stuffed into the RAID controller BBU write cache. You get the nearly unlimited write endurance of hard disk. And, since the cost of hard disks is much lower than SSD, you can do mirrored hard disks at a much more attractive price point... damn, now I've gotta go try that.

paleoN · Apr 18, 2013

jgreco said:
I'm wondering if maybe SSD is not the ideal SLOG device. What if... and yes this is a bit crazy... you were to put SLOG on a BBU JBOD hard disk? You get the extreme low latency of commits being stuffed into the RAID controller BBU write cache. You get the nearly unlimited write endurance of hard disk. And, since the cost of hard disks is much lower than SSD, you can do mirrored hard disks at a much more attractive price point... damn, now I've gotta go try that.

Should work well enough if you're not blowing through the cache. As long as you have a sufficient amount I don't see why NVRAM wouldn't be a win. Consider using LUNs or low latency disk managed by a controller with persistent memory for the ZFS intent log, if available.

How much cheaper would it be? Controller with NVRAM, hardisks and power for them vs. SSDs.

pbucher · Apr 18, 2013

jgreco said:
The difference is that sync=disabled is potentially placing the integrity of the filesystem itself at risk, whereas sync=standard means that ZFS metadata updates are protected but updates to the iSCSI file extent contents are not protected. It is a matter of the scope of the risk.

I have to say, I'm sorry you didn't search the forum and find my post from last week about this, but in your defense I had trouble finding my post too, so don't feel too bad.

Great post on NFS vs iSCSI. You really hit the nail on the head with it all. For my usage since the zvol is dedicated to NFS & ESXi doing anything async could lead to data loss, I hear you on the ZFS metadata, it's just if I'm going to suffer corruption of any kind having the ZFS file system & metadata intact doesn't buy me much I've still got hosed data and that requires a restore of some sort. I'm toying with the idea of doing snapshots every 15 min because even with sync=disabled I should be able to just roll back to a prior snapshot and start everything back up. Am I missing something with that idea?

jgreco said:
I'm wondering if maybe SSD is not the ideal SLOG device. What if... and yes this is a bit crazy... you were to put SLOG on a BBU JBOD hard disk? You get the extreme low latency of commits being stuffed into the RAID controller BBU write cache. You get the nearly unlimited write endurance of hard disk. And, since the cost of hard disks is much lower than SSD, you can do mirrored hard disks at a much more attractive price point... damn, now I've gotta go try that.

That's a very interesting idea, I'm going to slip in a 15k SAS drive into the drive array and see how it fairs.

cyberjock · Apr 18, 2013

You realize you could do the same thing with an Acard ANS-9010. The 9010 lets you split the device into 2 SATA devices if you want to setup a RAID0 for increased throughput. It has its own battery and backs up to CF card for extended power outages. I have 4 of them and I love them. The cost per GB is very high compared to SSD, but they do have the advantage of having unlimited write cycles and literally no latency.

I bought mine before SSDs became popular. It did cost me almost $1000 to put 32GB in 1 box, but after I booted from Windows with it once I knew that as soon as solid state media became usable and had low latency rotating media was going to have a serious problem long term as boot devices. If it weren't for the fact that my norco 4224 case doesn't have a 5.25" bay to install one of these I would probably have one installed right now for experimenting.

Edit: I also have a Gigabyte GC-RAMDISK. It does the same thing, but uses DDR1 RAM, is limited to only 4GB, and is SATA 1. Of course, for an SLOG, 4GB would be plenty.

jgreco · Apr 18, 2013

paleoN said:
Consider using LUNs or low latency disk managed by a controller with persistent memory for the ZFS intent log, if available.

Ok, no points for originality then.

How much cheaper would it be? Controller with NVRAM, hardisks and power for them vs. SSDs.

Stuffed a PO for a BBU for this X9DR7 board this morning.... so just having thought this all through, my answer:

Complex. So an LSI MegaRAID SAS 9265-8i is ~$700. The BBU is another ~$200, so figure $900.

But pretty much only an idiot would buy that. Why not just buy an entire SERVER BOARD for $715 and get a totally awesome Xeon E5 board that can take hundreds of gigs of RAM and sports two 10GbE ports ... while also having the LSI SAS2208 built in? Then you add the $200 battery onto that. And really, if you're obsessing about the need for low latency ZIL, then you are probably also in dire need of ARC and L2ARC for reads, so you really needed an E5 Xeon board anyways, and there's your totally perfect server board (I think/hope).

The thing is, all of the SLOG devices I've seen that are "proper" (i.e. supercap, etc) for ZIL are very pricey, and still tend to have the write endurance issues. I would expect that ZIL writes are generally sequential, so all you should need would be a vaguely modern drive. Your write cache can handle bursts. Even yesterday's drives could write at 100MB/sec. That's probably quite optimistic on my part though ;-) But a hard drive doesn't suffer write speed degradation the way an SSD does, and the endurance of a HDD is very high.

Anyways, I've got a system in the shop that I can play with for awhile, and a BBU is on its way for it. I am very curious to see which way performs better. Think we might even have some enterprise SAS 15K HDD's hanging around.

cyberjock · Apr 18, 2013

Read my post above yours.. mine was posted at the same minute as yours, so you likely didn't read it.

jgreco · Apr 19, 2013

cyberjock said:
Read my post above yours.. mine was posted at the same minute as yours, so you likely didn't read it.

Yeah, I hate Web forumware, it uniformly sucks.

pbucher · Apr 19, 2013

I've not had much luck with traditional SSDs(not that I've tested many but the rather expensive model I bought sux for throughput on small block sizes) sure it can do almost 500mbs on 128k blocks but try 4k blocks and it has trouble breaking 50mbs which is less then what the zpool can do on a sync write - I suspect that I might have some underlying issue because I'm getting rather bad performance back on some of my 15k SAS drives in the same box). If you have the money a Zeus-RAM is the ultimate ZIL device, it is a SAS drive that uses RAM and has a super capacitor that writes the RAM out to flash when ever power is lost, so you have no latency and endless writes but the safety of flash and not having to worry if the battery pack has gone bad or will hold until power is restored.

Anyways the interest to me on the RAID card thing is I've got a box that I can't put a decent SSD(it's a rented server host half way across the country) into has a pair of RAID cards(one being passed through to the VM running FreeNAS and has the disks all as JBODs for the zpool) so if I can get a decent ZIL using a RAID strip of some 15k SAS drives that will solve the issue of having sync=disabled for that box.

cyberjock · Apr 19, 2013

Well, if you know of any benchmarks I can perform to test the 4k write size let me know. I'd be happy to run them for you on my devices. :)

jgreco · Apr 19, 2013

iozone is probably the most easily understood test... can do lots of interesting things with it.

cyberjock · Apr 19, 2013

And I just edited your post with my comments.. damn I must be sick! This cold or whatever is kicking my butt!

Well, give me the commands you want run. I have one installed in Windows 7 right now, but if you know the commands for FreeNAS I suppose I can accommodate you there too.

jgreco · Apr 20, 2013

Don't worry, it's been a bad day all around. I've yet to figure out whether I love or hate FreeBSD 9. They did away with sysinstall, replacing it with bsdinstall, which is ... well, at least for our purposes here, a totally pointless PoS. On the other hand, the other parts of our system building tools are all anywhere from 10 to 20 years old, with legacy cruft dating back to (at least?) FreeBSD 2, meaning they're convoluted and twisty. But looking at how deployment has changed, really... we no longer need to support half a dozen installation methods, such as boot-floppy-then-download-via-internet. And we don't really need to worry about installing on systems with limited disk, because even the smallest physical disk is sufficient, and for VM images, we can probably afford a larger base image without too much trouble. So it's been a great time to actually get rid of 20 years of MBR and traditional BSD disklabel cruft and go all GPT/geom, and all sorts of other fun design issues. A single script that takes a blank disk and puts a usable, localized version of FreeBSD on it...

pbucher · Apr 21, 2013

cyberjock said:
Well, if you know of any benchmarks I can perform to test the 4k write size let me know. I'd be happy to run them for you on my devices. :)

What I'm doing for extremely basic testing of things is just a simple brute force test using "dd if=/dev/zero of=/dev/daXX bs=4k count=20000". If it goes well I'll bump count up to 2000000. I usually first do it to the raw drive I'm looking to use as the zil device. Then add the zil and try it as a file on a zpool(but using zpool iostat -v 10) to actually see the raw performance since at that point you've got ZFS caching potentially involved. Finally use a *nix VM on my ESXi box and add a 2nd HD to the VM and try dd again against the raw device, run it twice if using thin provisioned drive to eliminate the other head of the initial allocation of disk space.

I'm currently scratching my head on trying to use the 15k SAS drives to prototype a RAID card/ZIL setup, doing diskinfo -t I get a inside(the worst) transfer rate of 75MBs but doing my usual dd test I'm barely able to break 1MBs and it's a complete no go as a ZIL device(thought since diskinfo had good performance maybe the zil would work as well). On my SSD drives I get like 40 to 50MBs with the above test(on the ZeusRAM I get 313MBs on disktest and get above 100MBs from my VMs with sync=standard). I think it's a drive firmware vs controller vs freebsd driver issue, other folks seem to have hit the problem with no resolution, I've tried different drives(including a different manufacturer) and different bays and I can't break 1MBs. At the same time this is all inside a enclosure that has 2 working zpools that have great performance so I know I've got a working config otherwise.

Important Announcement for the TrueNAS Community.

NFS Performance with VMWare - mega-bad?

Resident Grinch

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Contributor

Contributor

Resident Grinch

Wizard

Contributor

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Contributor

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Contributor

Similar threads