NFS Performance with VMWare - mega-bad?

Status
Not open for further replies.

MichelZ

Dabbler
Joined
Apr 6, 2013
Messages
19
Hi all

First of all, yes I have read the various threads about NFS Performance with VMWare, and I know and get that it is all related to the SYNC writes requested by VMWARE's NFS.
What I don't get however, is why the performance is THAT bad...

I have the following equipment:

- FreeNAS-8.3.1-RELEASE-x64 (r13452)
- Supermicro Server with Intel Xeon E5620 @ 2.4 GHz
- 32 GB Memory
- 24x INTEL SSD 256 GB in 3x raidz1 @ 8 disks
- 1x Gigabit connection
- No ZIL
- No L2ARC

My IOANALYZER (vmware) now tells me with the MAXIOP configuration that my NFS Datastore has a write speed of 86 IOPS (!)

What could be the issue of that?

I have created an iSCSI datastore, and I was able to squeeze 28600 write IOPS out of the system...
Any ideas?

I would expect at the very least 2000 IOPS out of an ALL-SSD system even on an all-sync-nfs-zfs configuration?

Thanks
Michel
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
When a sync write is issued, it is normally flushed from the OS straight to the hard drive. In the case of a VM, you are flushing to a virtual disk, which then is picked up by ESXi and translated to the applicable disk(s), then flushed to the disk. After the data is flushed to the disk, then the disk reports back that the write is complete, that is picked up by ESXi and translated back to what virtual disk writes were flushed, then the disks flush appears complete to the VM, which is then passed to FreeNAS.

You've added a boatload of additional hoops to jump through. Some of those hoops are very latency intensive(especially if your CPU is heavily taxed at that moment).

So yeah, performance tanks something horrible, and the only fix is to throw even faster(and naturally more expensive) hardware at it. You didn't say how your whole configuration is setup, but performance doesn't tank as badly if you are using vt-d technology and passing through straight to a PCIe SATA controller. In fact, using vt-d is the only method that the forum vmware ninja's recommend, and for multiple reasons. Mostly for reliability of your data, but also for performance. In some cases, if you make any comment that you aren't using vt-d and asking for help recovering data they won't even give you the time of day because they're through trying to educate people to not do dumb things.

You can disable the sync writes in FreeNAS so that all writes are "non-sync" but you'd better be doing religious backups of your data nightly because the manuals warn that doing it for any reason except for testing is beyond stupid.
 

MichelZ

Dabbler
Joined
Apr 6, 2013
Messages
19
Thanks for your answer.
I have no intention of disabling SYNC, I just want to know why it is so damn slow. I mean 86 IOPS for 24 SSD's? Come on! My iPhone gets more IOPS.

Maybe I was not clear enough about my setup. This server is just storage, and is installed on BARE METAL, so no VMWARE involved there. I have another two servers with VMWARE which connect to the FreeNAS server using gigabit.

So

Code:
ESX1 ---- 
               SWITCH  --- FREENAS
ESX2 ---- 



And of course we are doing nightly backups ! :)

Thanks
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ah. Then the hoops are slightly different than I had mentioned above, but you do have more hoops to jump through regardless. I'm not sure what I'd recommend for you. You're trying to do things right, but also are thinking big(which is rare). You'd probably have to start looking for where the limitation is the highest and how to fix it(and I have no idea how you'd start trying to find your limitation).
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Interesting but touchy tool.

I tested it on a locally attached RAID1 SSD datastore and got

Max_IOPS: 100MB/sec, 63500 IOPS.
Max_Write_IOPS: 18MB/sec, 37998 IOPS.

To a FreeNAS HP N36L, 16GB, sync=disabled, RAIDZ2 on 4 slowish disks,

Max_IOPS: 14MB/sec, 28836 IOPS.
Max_Write_IOPS: 10MB/sec, 20585 IOPS.

To a FreeNAS Supermicro E3-1230, 8GB, sync=standard, RAIDZ2 on 4 fastish disks,

Max_IOPS: 18MB/sec, 37837 IOPS.
Max_Write_IOPS: 0.03MB/sec, 60 IOPS.

Perhaps you could explain the formula you used to derive your expectation of 2000 IOPS. From a strictly compsci point of view, that would seem to be a number just randomly plucked out of thin air; an inspection of the design of NFS on top of ZFS suggests that the multiple layers of latency, caching, aggregation, RAIDZ striping, and other things involved in a sync write are all IOPS-decimating events. I know it seems like SSD should be the cure-all, but there are significant differences in how a ZFS pool of RAIDZ'd disks work and how a single SLOG device works. A RAIDZ pool write is a very expensive/intensive thing, and basically that entire operation has to be committed to disk for each sync write that is confirmed to ESXi. Ironically the large number of SSD devices may actually be pulling your IOPS number down. But if that answer feels rotten to you, I can at least sympathize with you. :-/

ESXi has been known for years to have sync write issues with NFS; the solution from many vendors is to force async. A relevant FreeBSD discussion is available.

From my perspective, a SLOG device is advisable for NFS and ESXi use. It is the way the system is intended to work. Barring that, iSCSI with some tuning. We go downhill from there into various async hacks which are known to be various shades of evil.
 

MichelZ

Dabbler
Joined
Apr 6, 2013
Messages
19
The Formula is 10% of one SSD's IOPS capabilities (20k+). This is what I would expect to see as an absolute minimum. Even with all overhead of NFS, ZFS, SYNC and what not.
But I see 83 IOPS, which is under 0.415% of a single SSD's capability, this is just ridicoulous.

I will try iSCSI with SYNC=ALWAYS next, and NFS with sync=disabled.
I can also try to have an SLOG device to see if it makes a difference.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Think about it like this.. if you have 1ms of latency just from the network, you're looking at a max of 100 iops/sec if each op is a sync write and you ignore all other latency issues. ZFS does add its own latency because of the need to calculate checksums. If you switched to a mirrored zpool you'd probably see a pretty big increase in performance.
 

MichelZ

Dabbler
Joined
Apr 6, 2013
Messages
19
The network does not have latency (iscsi has 40k IOPS read and 28k IOPS write, same network)
 

pekopeter

Cadet
Joined
Apr 7, 2013
Messages
1
Think about it like this.. if you have 1ms of latency just from the network, you're looking at a max of 100 iops/sec if each op is a sync write and you ignore all other latency issues. ZFS does add its own latency because of the need to calculate checksums. If you switched to a mirrored zpool you'd probably see a pretty big increase in performance.

cyberjock, i think MichelZ told us, that he was also trying using iscsi on the same ZFS Pool with 20k write IOPS.

So neither disk nor network can be the issue here... The question get's back to: What does iSCSI differently to NFS to gain way more write IOPS on the same ZPool...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The network does not have latency (iscsi has 40k IOPS read and 28k IOPS write, same network)

All networks have latency. Just the fact that data has to be processed by the NIC, sent down network cables to any other switches until it gets to the destination, then processed by the NIC adds some latency. It might be a fraction of a millisecond, but that adds up with speed matters. It's not instantaneous or the first CPU ever made would have been infinitely fast. :P

cyberjock, i think MichelZ told us, that he was also trying using iscsi on the same ZFS Pool with 20k write IOPS.

So neither disk nor network can be the issue here... The question get's back to: What does iSCSI differently to NFS to gain way more write IOPS on the same ZPool...

I think what he said was that 1 SSD was capable of doing 20k IOPS.

If you want to know what iSCSI does differently than NFS you'd need to read up on the protocol. The issue is partly the protocol as well as the cumulative effects of many smaller effects that most of us in IT ignore for normal day-to-day operations. This issue has been discussed over and over and over on the forums, and is why the general rule for iSCSI is not to put it on ZFS. It has also been discussed over and over that VMWARE's NFS writes are pure suckage for performance. This is solely because VMWare doesn't want to be blamed for any lost data, so the solution is to make every single write a sync write. It's cheap, it's dirty, but it's a simple way to ensure maximum data reliability.

As I tried to explain above, if a single NFS write of any size(even 1 byte) is made, VMWARE will initiate a sync write(and no other writes will take place until the VMWARE machine gets a response back that the write is complete). So you have to include all of the latency of all of the hardware and software hoops that are jumped through to make that write happen as well as reporting back to the VMWARE machine that the write is complete. Only then will another write even begin from that machine. And don't even ask what happens if 2 machines make write requests at the same time. So yes, every single ms/ųs matters. Add that all up and if you have even 1 ms of latency across the entire path you will get an amazingly high result of... 100 IOPS. Poof, instant and guaranteed poor performance from even the fastest of hardware and lowest of network latencies. How do you fix it? Go and start removing every ųs of latency you can from everywhere you can. This is where system admins get paid BIG dollars to find and remove these bottlenecks.

The real catch to this situation is that I don't know of anyone that peruses the forums has the experience and know-how to even find these bottlenecks, let alone fix them. And even if they can be "fixed" it might not increase performance enough to make the OP happy. There is inherent latency with having to move data around. This is why Intel makes big dollars selling faster CPUs every year. They seek out and remove nanosecond/picosecond bottlenecks everywhere they can and hope that they all add up. And guess what, they do!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The network does not have latency

That's just totally delusional. Every network has latency. Every network stack has latency. Every bus has latency. There is latency at almost every step of everything we do with computers. The trick is to understand it, not to claim it doesn't exist.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That's just totally delusional. Every network has latency. Every network stack has latency. Every bus has latency. There is latency at almost every step of everything we do with computers. The trick is to understand it, not to claim it doesn't exist.

And with that.. I will bow out of this conversation.

Good luck with your endeavors OP. If you do fix your system you should post back your solution for others that may run into your issue someday.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I really gotta stop composing messages while doing other things, looks like you posted a similar-but-more-comprehensive answer in the meantime. Bad habits to swear off by 2020: opening half a dozen threads in half a dozen tabs.
 

MichelZ

Dabbler
Joined
Apr 6, 2013
Messages
19
Latency is certainly a factor, but SSD have under or around 100 µs latency (HDD: 6-12 ms), the Network has <1 ms latency, and all other components should have even lower than that AFAIK.

I will try out some stuff...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You haven't considered the latency and other various limits in the operating system, especially the ZFS write path. You don't just get a packet in from the ethernet and it magically pops out a SATA port. This is a hard lesson for people to wrap their heads around, especially in this age of almost infinitely fast machines... in any case, ZFS is built to work a certain way and offer certain features. Part of that involves things like queuing up writes into transaction groups, in order to more efficiently write contiguous blocks out to disk, etc. That process is defeated by NFS sync, because for each NFS transaction, ZFS has to confirm the write to the pool, which means that ZFS has to write it to the pool, which means that ZFS has write the stripe - and this goes on for EACH NFS transaction. Your SSD pool does not have "under or around 100 µs latency" for all of them to write each stripe over and over and over as each NFS transaction comes in. You're simply wishing for it not to exist, when in fact it does. Asking for sync writes in ZFS is *very* *expensive* as it is basically fighting the design of the pool and transaction group system. ZFS addresses this through the ZIL mechanism implemented on a separate device.

Really, I have no desire to keep going on this topic, so I'm going to leave you with a homework assignment that will either straighten you out or it won't. There are loads of resources out on the 'net that discuss ZFS. None of this is new or surprising.
 

pdanders

Dabbler
Joined
Apr 9, 2013
Messages
17
Is there a reason you are trying to use NFS instead of iSCSI? iSCSI is significantly better for VMWare datastores for a variety of reasons.
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
VMware & NFS on ZFS:

There are 2 ways to get decent performance with this setup.

1. Set sync=disabled on the zfs volume you are sharing to the ESXi servers. This has some risk to it and should only be considered if you have UPS backup and automated shutdown for a power failure. The risk comes in if you have FreeNAS crash on you(be bad hardware or software bug) you can have really bad problems then with the VMs that where being stored on the volume, though in reality probably only 1 degree worse then a physical server running that OS bad the power plug pulled on it(which can be really bad if it happens at the wrong time).

2. Add something like a STEC ZeusRAM as a ZIL/cache device for the volume. This will give you almost as much performance as the first option. They are expensive but are what is needed to make a ZFS setup ready for heavy duty enterprise usage, don't be fooled by using a standard SSD drive, they just don't compare to one of these.

Also some advice on how to benchmark ESXi to it's storage connection: Setup a testing VM of your favorite *nix OS and then add a 2nd hard drive that you place on the shared volume(don't matter where the rest of the VM is). Don't even bother to format or partition this 2nd drive. Then do some tests use dd with something like "dd if=/dev/null of=/dev/sdb bs=4k count = 40000" using what ever your 2nd HD showed up as in place of /dev/sdb. This takes any kind of file system/caching out of the picture and it hitting the raw 2nd drive and you eliminate a lot of variables. I find it best to simply watch the output from "zpool iostat -v 10" on the FreeNAS box to see the actual performance and also see how much the ZIL/cache device is getting used.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
1. Set sync=disabled on the zfs volume you are sharing to the ESXi servers. This has some risk to it and should only be considered if you have UPS backup and automated shutdown for a power failure. The risk comes in if you have FreeNAS crash on you(be bad hardware or software bug) you can have really bad problems then with the VMs that where being stored on the volume, though in reality probably only 1 degree worse then a physical server running that OS bad the power plug pulled on it(which can be really bad if it happens at the wrong time).

Wrong. "Really bad problems" can range from nothing wrong at all to the zpool being unmountable and you losing all data in the zpool forever (hope you have backups).

There's a reason why sync=disabled in all documentation says it should be for testing only and never used on a production system. Disabling sync removes several of the ZFS mechanisms that ensure your data is safe. Honestly, if someone posted a thread that they had sync=disabled and their zpool was now unmountable I wouldn't even waste my time to respond. It's so far beyond irresponsible to set sync=disabled I wouldn't even waste my time telling them I was sorry. THAT'S how stupid it is.

There is no doubt people will eventually try disabling sync because its been mentioned(and even being recommended as a solution now!), find the performance gains are amazing and keep it. I'm sure a few people will even post back that "I've been using it for weeks/months and had no problems". But then when they lose their data they will be sorely disappointed at how nearly impossible it is to get your data back and since so few people seem to keep religious working backups available they really lost everything forever.

I can't help but /facepalm at the post.
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
Ok well let's just chalk this up to matter of opinion.

But in real life if you gotta run ESXi against a ZFS based SAN using NFS and you can't afford a decent ZIL device you don't have many options if you need decent performance. sync=disabled has the same risk as forcing NFS to do async operations for ESXi to get decent performance.

Also if it's so stupid then why is the option there? I'm not disagreeing that it's not a good idea, just when you don't have the $s it's about the best option to make ESXi run half decent which I thought was the point of this thread.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
But in real life if you gotta run ESXi against a ZFS based SAN using NFS and you can't afford a decent ZIL device you don't have many options if you need decent performance. sync=disabled has the same risk as forcing NFS to do async operations for ESXi to get decent performance.

Do you know what Sun(and now Oracle) recommend if NFS can't perform well using ZFS? Using UFS. :O They don't even discuss VMWare ESXi specifically, but discuss issues with sync writes and some ways you can tweak ZFS to serve your exact needs.

I'd make the exact same argument any day of the week. It's ludicrous for some lowly user like you or me to ignore the official Sun(and Oracle) documentation that says "FOR TESTING ONLY" and puts it in all caps. Some of their documentation that was in color even put those words in red. Do you really think you know more than the guys that invented ZFS? I know I don't nor would I even try to argue why they are wrong. I'd trust anything they say over anything anyone else would say any day. I'd also consider anyone that chooses to deliberately ignore warnings that are so prominently discussed all over the internet as well as in their technical manuals to be somewhat crazy and have very poor decision making skills.

Oracle/Sun also recommend configurations where you save to UFS but have it mirror to a zpool elsewhere at regular intervals and other extravagant designs to achieve a high level of data reliability. There's so much to ZFS you could probably get a bachelor's degree in ZFS and still not know everything. Why do you think Oracle(and Sun) were able to charge such outrageous prices for their hardware and software support? Because if you wanted to have complete faith in your data being safe you paid it. And for many companies they'd happily pay it because it was far cheaper than trying to reproduce the data later. There is also a reason why Oracle has closed the source for all ZFS versions after 28. They want more money and making ZFS available to the masses isn't good for their business model. I'm sure they aren't thrilled that ZFS is becoming far more "user friendly" with FreeNAS. Before things like FreeNAS the barrier to market was quite high. You needed a deep knowledge of ZFS to use it properly or be willing to fork over big bucks for support from Oracle/Sun. Naturally, they were hoping for the latter.

Also if it's so stupid then why is the option there? I'm not disagreeing that it's not a good idea, just when you don't have the $s it's about the best option to make ESXi run half decent which I thought was the point of this thread.

Sun documentation says that it was there for troubleshooting. Some issues sometimes requires you to verify latency issues that you may not have much control of, so need to have the option of taking drastic(and often dangerous) steps to identify the issue so you can correct it. Someone had posted somewhere that back when the option was created there was the theory that RAM may eventually be non-volatile and that this function may be useful in the future. Remember, ZFS was designed with it becoming more beneficial as drives get bigger and bigger(because size is increasing faster than reliability). Since I don't see them getting smaller anytime soon its pretty safe to say that ZFS is becoming more and more useful each day and probably will until something better comes around (BTRFS?).

ZFS was for business use. Yes, its trickling down for home users to use. And even for home use, there are some lines that should still never be crossed. People go short on RAM despite how many people have lost data because of kernel panics related to insufficient RAM, don't use ECC(which Sun and Oracle both list as a system requirement for ZFS), and generally do things that don't make sense. But one line that I'd never ever cross would be disabling sync.
 
Status
Not open for further replies.
Top