Low NFS write throughput, desperate for help

mosquitou · Mar 14, 2014

Hi all, I'm stuck with my FreeNAS performance tuning and I'd like to consult on your expertise. I focus on my write throughput (although read throughput is not great as well), and now it is at 20MB/s tested from ESXi. From reading the post in forum, I have some rough idea:
(1) force nfs async mode (to deal with ESXi impl)
- zfs set sync=disabled
- vfs.nfsd.async=1
(2) enable autotune
(3) have enough RAM and proper NIC
(4) stay away from RAID5/RAIDZ
(5) setup ZIL with SSD (I don't have this option)
(6) setup L2ARC with SSD (I don't have this option)
(7) disable atime, dedup, compression

Yet still my write throughput is far from expected. Anyone could point me a direction? If more info needed, please don't hesitate to ask.

My hardware spec:
FreeNAS server:
· - HP DL380 G8
· - Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
· - BCM5719 Gigabit Ethernet 4 port
· - 64G DDR3 ECC RAM
· - Flash drive boot FreeNAS 9.2.0 RELEASE (diskless)
DAS:
· - EonStor R2240 + EonStor J2000R extension (JBOD)
· - RAID Controller: 1GB cache EonStor R2240 equiped with 3x 300G SAS disk 15k rpm (I cannot disable RAID controller function, so I let it to manage disks and expose LUNs to FreeNAS. Anyway Disk IO is not an issue)
· - JBOD: EonStor J2000R equipped with 10x 3T SATA disk 7200rpm

My configuration is shown as following diagram.

My FreeNAS is release 9.2.0, using default settings and then enabled autotune.
Sysctl:
kern.ipc.maxsockbuf = 2097152 (Generated by autotune)
net.inet.tcp.recvbuf_max = 2097152 (Generated by autotune)
net.inet.tcp.sendbuf_max = 2097152 (Generated by autotune)
vfs.nfsd.async=1 (added by me)
Tunables:
vfs.zfs.arc_max = 47696026428 (Generated by autotune)
vm.kmem_size = 52995584921 (Generated by autotune)
vm.kmem_size_max = 66244481152 (Generated by autotune)

No L2ARC, no ZIL.

I have two zpool:

RAID1ZFS1: not used in test, configured with 4x 3T SATA 7.2k disk
RAID6ZFS1: Use RAID Controller to configure RAID6 on 6x 3T SATA 7.2k disk, and then expose a single LUN to FreeNAS in order to create a ZFS volume (stripe)

In ZFS volumes, it looks like:

set sync=disabled for RAID6ZFS1:

[root@freenas ~]# zfs get sync
NAME PROPERTY VALUE SOURCE
RAID1ZFS1 sync standard local
RAID1ZFS1/test sync standard inherited from RAID1ZFS1
RAID6ZFS1 sync disabled local

RAID6ZFS1/ORCHID sync disabled local

LACP Link Aggregation works fine.

Network Test
iperf –d default network test gives following result:
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[ 7] local 10.77.24.16 port 5001 connected with 10.77.24.17 port 54549
------------------------------------------------------------
Client connecting to 10.77.24.17, TCP port 5001
TCP window size: 96.5 KByte (default)
------------------------------------------------------------
[ 9] local 10.77.24.16 port 23951 connected with 10.77.24.17 port 5001
Waiting for server threads to complete. Interrupt again to force quit.
[ ID] Interval Transfer Bandwidth
[ 7] 0.0-10.0 sec 792 MBytes 663 Mbits/sec
[ 9] 0.0-10.1 sec 679 MBytes 566 Mbits/sec

Local Disk IO Test
dd test performed locally at FreeNAS yields following result:

[root@freenas /mnt/RAID6ZFS1/ORCHID]# dd if=/dev/zero of=testfile bs=128k count= 50k
51200+0 records in
51200+0 records out

6710886400 bytes transferred in 13.790086 secs (486645724 bytes/sec)

iozone test yields following result:

[root@freenas ~]# iozone -R -l 2 -u 2 -r 128k -s 1000m -F /mnt/RAID6ZFS1/ORCHID/f1 /mnt/RAID6ZFS1/ORCHID/f2

"Throughput report Y-axis is type of test X-axis is number of processes"
"Record size = 128 Kbytes "
"Output is in Kbytes/sec"

" Initial write " 6145867.75
" Rewrite " 2953791.75
" Read " 6836807.50
" Re-read " 8055269.75
" Reverse Read " 5743218.00
" Stride read " 6824206.75
" Random read " 8265411.25
" Mixed workload " 5935495.25
" Random write " 4638376.50
" Pwrite " 4055769.38
" Pread " 7961281.00
" Fwrite " 2854679.12
" Fread " 6021551.75

From the above, I assume network and disk IO works fine separately. Next I perform the same dd test using nfs mount from CentOS-VM and CentOS-BareMetal and there comes my problem:

NFS test from CentOS-VM:
[root@localhost etc]# mount -t nfs -o tcp,async 10.77.24.16:/mnt/RAID6ZFS1/ORCHID /mnt/ORCHID/

dd test

[root@localhost ORCHID]# dd if=/dev/zero of=file bs=128k count=50k
^C18625+0 records in
18625+0 records out
2441216000 bytes (2.4 GB) copied, 90.2074 s, 27.1 MB/s

iozone test:

Commandline used: iozone -R -l 2 -u 2 -r 128k -s 1000m -F /mnt/ORCHID/f1 /mnt/ORCHID/f2

"Throughput report Y-axis is type of test X-axis is number of processes"
"Record size = 128 Kbytes "
"Output is in Kbytes/sec"

" Initial write " 31658.41

" Rewrite " 28010.82

" Read " 95170.81

" Re-read " 114052.75

" Reverse Read " 216960.58

" Stride read " 414234.79

" Random read " 3097005.52

" Mixed workload " 37324.00

" Random write " 28707.09

" Pwrite " 29535.80

" Pread " 108033.00

" Fwrite " 29034.48

" Fread " 109954.27

iozone test complete.

While performing dd test:

top

iostat

[root@freenas ~]# zpool iostat 1

---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.74G 10.9T 0 523 0 60.5M
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.74G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.74G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.80G 10.9T 0 542 0 62.8M
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.80G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.80G 10.9T 0 265 0 33.2M
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.87G 10.9T 0 267 0 28.4M
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.87G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.87G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.93G 10.9T 0 523 0 60.6M
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.93G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.93G 10.9T 0 390 0 48.8M
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.99G 10.9T 0 134 0 11.8M
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.99G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
RAID1ZFS1 238G 5.20T 0 0 0 0
RAID6ZFS1 7.99G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----

ifstat:

From CentOS bare metal I have slightly better result but still far from expected.

It seems to me, network is not saturated (far from iperf test), disk IO is not saturated (lots of idle in iostat). Hence I was wondering where FreeNAS is choking on, to give e.g. 20MB/s write throughput. I understand I don’t have the best rationale to justify my current setting, and I understand the parameters may not be optimized. But still it should not give such a performance result, is it? Thanks greatly for your help.

Rand · Mar 14, 2014

If you're not using the Raid controller for a pool you could use it as zil if it has a bbu?
I got ~50MB/s nfs write with an old Perc5 with 256MB Cache and some old 250GB SATA1 disks

eraser · Mar 14, 2014

In my testing, setting "vfs.nfsd.async=1" does not seem to have any affect at all. I would suggest removing it.

"zfs set sync=disabled" only makes a difference if your NFS Client forces all writes to be sync (such when using NFS for a VMware ESXi datastore). Since you are mounting NFS directly from CentOS there should be no need set sync=disabled. I would suggest setting it back to sync=standard.

Hmm, I don't know if you should use autotune. See: http://forums.freenas.org/index.php...tings-negative-read-impact.19153/#post-106613

eraser · Mar 14, 2014

Am I correct in assuming that you do NOT have dedupe enabled? Good.

Do you have compression enabled? If so, you can't rely on 'dd if=/dev/zero ...' for testing since those zeros compress very well. Same with iozone tests - you'll need to add the following iozone parameters to have it use uncompressible data: "-+w 1 -+y 1 -+C 1".

Can you confirm that your CentOS NFS Clients are using at least a 64KB rsize/wsize? The output of "nfsstat -m" on the client should tell you what it is using.

Can you test with a different version of linux to rule out CentOS as the cause of the problem? I have had good luck testing NFS performance with Ubuntu.

mosquitou · Mar 15, 2014

Rand said:
If you're not using the Raid controller for a pool you could use it as zil if it has a bbu?
I got ~50MB/s nfs write with an old Perc5 with 256MB Cache and some old 250GB SATA1 disks

Thanks for your reply Rand. If I understood it well, unfortunately I am using RAID Controller for the pools, because I cannot disable the RAID controller function in the DAS and "downgrade" it to JBOD. Though I can use one 300G 15k rpm SAS drive as ZIL from EonStor R2240 (the RAID controller), and I haven't done so, maybe it worth to test.

To understand this better (please correct me if i'm wrong), ZIL is meant for ZFS files system, which means it may help to boost up the local disk IO write test. But for my case, without ZIL my local disk IO is good enough already, therefore ZIL may not be the main reason why it's choking...

mosquitou · Mar 15, 2014

eraser said:
Am I correct in assuming that you do NOT have dedupe enabled? Good.

Do you have compression enabled? If so, you can't rely on 'dd if=/dev/zero ...' for testing since those zeros compress very well. Same with iozone tests - you'll need to add the following iozone parameters to have it use uncompressible data: "-+w 1 -+y 1 -+C 1".

Can you confirm that your CentOS NFS Clients are using at least a 64KB rsize/wsize? The output of "nfsstat -m" on the client should tell you what it is using.

Can you test with a different version of linux to rule out CentOS as the cause of the problem? I have had good luck testing NFS performance with Ubuntu.

Thanks a lot eraser, your recommendation sounds interesting. Now I understand better of zfs sync parameter. Just to double check: from my FreeNAS server, do I need to do something on /etc/exports to make it "async"? Yes I'll turn off autotune and try again.

Compression is disabled on both /mnt/RAID6ZFS1 zpool and /mnt/RAID6ZFS1/ORCHID dataset. If I remember it correctly, CentOS NFS client default setting is rsize=wsize=65536, I'll confirm this point on Monday and as well set up a ubuntu LTS client in ESXi to test it again.

Btw I'd like to get a proper understanding of how NFS works. Assume it's in NFS async mode: so NFS client specifies rsize and wsize as local read/write buffer, and at NFS server side I assume there's also some buffer to receive such IO requests and push to disk controller as much as it can handle. How do I inspect on this server side buffer (if it exists at all)?

Have a nice weekend! I had mine reading all your replies:)

cyberjock · Mar 15, 2014

mosquitou,

you have any idea how completely bad it is to even consider ZFS on RAID? There's no self-healing, etc. You literally just gutted ZFS with your poor choices. Just got done laughing at someone else http://forums.freenas.org/index.php?threads/cannot-expand-my-zpool-need-help.19337/ with jgreco. I'm guessing you want a turn at being laughed at?

And we laugh at people that do sync=disabled because of the dangers. Even the ZFS inventors say its for troubleshooting ONLY and should never ever be used on a system with actual data.

Well, good luck. I'm sure you'll be a statistic soon enough around here. :)

jgreco · Mar 15, 2014

I need a new pick algorithm for determining which threads to respond to.

My old one consisted of

if (any_post(makes_head_go("....ow"))) then exit(1);

But the performance issue is kind of interesting here. I wonder how thoroughly the network was tested.

cyberjock · Mar 15, 2014

Well, I know for a fact that doing ZFS on RAID causes all sorts of performance anomalies, so I wouldn't be even slightly surprised if he ditched the RAID card like he should have on day 0 and his problems will magically disappear.

jgreco · Mar 15, 2014

From my point of view, there's simply far too much "load it on" / "run it" / "wonder why it performs like crap" that goes on...

Admittedly I may err in the opposite direction as it can be 3-6 months for the testing, burn-in, debugging, beta testing, then deployment phase.

But usually along the way most problems are isolated and remediated.

The folly of trying to identify performance problems in a nonstandard, ill-advised setup by asking the forum for expertise ... well, some of us took the time to write docs to help users test their own systems.

mosquitou · Mar 15, 2014

Thank you cyberjock and jgreco for your honest words, really. I'm aware of the ZFS over RAID warning from day 0 indeed (http://doc.freenas.org/index.php/Hardware_Recommendations), and naturally thinking it is solely a disk controller issue and remotely relevant to my NFS performance (when local disk test yields good result). I agree it's a bad idea from the rationale point of view, and don't want to bother you with the history behind, which is not the purpose here. Being said, I'll definitely roll out another test without RAID controller and share back the result.

Before that, or let's assume ZFS over RAID is the problem, I still don't understand how it contributes to NFS performance problem. Tell me if i'm fussing over it, still I think this is something worth to know, or to learn.

jgreco · Mar 15, 2014

The problem here is we can't do more than stab in the dark.

Imagine this is a performance car forum. For the most part the formula for a fast racecar is known, or a pleasant touring car, and we can even stretch those things to do station wagons.

But here you come with a truck tractor. It's diesel, has three axles, and ten wheels. We know the basics, like it has a steering wheel, gearshift, and brakes. But we don't really have much advice as to why your truck can't go past 180MPH, and can't even reach 70MPH.

We have some general ideas but quite frankly it is in your court to determine if you can lower the weight, optimize the engine, or use different tires. You probably need to do a bunch of things to get above 70MPH.

I already suggested checking the network, but really you should stop looking at the thing as a whole and start testing subsystems to see what improvements can be made.

cyberjock · Mar 15, 2014

NFS is typically a large number of small writes. Generally, that creates certain problems with IOPs that are the absolute worst thing you want when you do ZFS over RAID. You could potentially improve performance by enabling the write cache on your controller and see things improve somewhat. But you'd be a complete idiot(and I call you out on it if you did that for a system that has data) because of the potential for data loss.

And do yourself a favor and stop reading what we say, try to interpret why we said it, then do exactly what we told you not to do.

If we tell you not to do a RAID controller don't try to explain how it's okay for your situation. Just simply accept that a RAID controller is stupid. If we say 8GB of RAM minimum don't try to tell yourself 4GB is okay because you think 8GB is overboard. That thought process is what screws over more newbies than anything else. This isn't like other OSes. You have to know the reasons why, and 99% of the time you are going to get it wrong(just like you did with the RAID controller). If we tell you "don't do this" that doesn't mean 'unless you know better". Especially if you don't have plenty of ZFS and FreeBSD experience, which you clearly don't have. :)

eraser · Mar 15, 2014

mosquitou said:
To understand this better (please correct me if i'm wrong), ZIL is meant for ZFS files system, which means it may help to boost up the local disk IO write test. But for my case, without ZIL my local disk IO is good enough already, therefore ZIL may not be the main reason why it's choking...

Dedicating a separate drive for your ZIL (also known as a SLOG) will only help if you have a lot of sync writes. Since your earlier test of setting "sync=disabled" on your dataset did not make a difference, I don't think adding a SLOG will help you.

eraser · Mar 15, 2014

In my opinion mosquitou isn't really running FreeNAS on a RAID controller. mosquitou is running FreeNAS against a LUN presented by a SAN appliance. It just so happens that the SAN presents that LUN from a pool of disks that happen to be part of a RAID set, but to the FreeNAS box it should look like a directly attached disk.

Local disk testing shows that backend storage is plenty fast. The anomaly mosquitou is trying to figure out is why performance degrades when writes are done over NFS vs being local.

Yes it may well just happen that the backend SAN is the cause of the problem, but since local disk io performance is good I don't think we can just write it off.

=-=-=

Interestingly, I see from the diagram that an Infortrend "EonStor" disk appliance is in use. I see that Infortrend also makes "EonNAS" appliances which appear to use ZFS as their underlying filesystem. ( see: http://www.infortrend.com/global/products/families/EonNAS , and their support forum at http://www.eonnas.com/forum )

eraser · Mar 15, 2014

mosquitou, can you show the output of running 'dmesg'?

mosquitou · Mar 15, 2014

eraser said:
Interestingly, I see from the diagram that an Infortrend "EonStor" disk appliance is in use. I see that Infortrend also makes "EonNAS" appliances which appear to use ZFS as their underlying filesystem. ( see: http://www.infortrend.com/global/products/families/EonNAS , and their support forum at http://www.eonnas.com/forum )

Yes eraser, you're absolutely right. As fas as I'm concerned of history, we did not buy EonNAS because EonStor DS is cheaper and we have a spare server to run FreeNAS. I can't tell much since this is what it is when I arrive at it, and I was more of a user than owner of FreeNAS setup. To set the spirit right I'm more to solve than judge here. Besides it is not that bad what I have, or otherwise I won't even have a chance to work on FreeNAS and learn from you guys.

That's enough of an anecdote and I'm not tend to grow this thread without concrete test result next week.

eraser said:
mosquitou, can you show the output of running 'dmesg'?

Sure thing, it's coming...

cyberjock · Mar 15, 2014

eraser said:
In my opinion mosquitou isn't really running FreeNAS on a RAID controller. mosquitou is running FreeNAS against a LUN presented by a SAN appliance. It just so happens that the SAN presents that LUN from a pool of disks that happen to be part of a RAID set, but to the FreeNAS box it should look like a directly attached disk.

Local disk testing shows that backend storage is plenty fast. The anomaly mosquitou is trying to figure out is why performance degrades when writes are done over NFS vs being local.

Yes it may well just happen that the backend SAN is the cause of the problem, but since local disk io performance is good I don't think we can just write it off.

And I don't care about any of that. I figured either he's using a real RAID controller, he's doing a SAN, or he's doing a SAN that uses RAID. But, it doesn't change the outcome.. RAID + ZFS = FAIL for every possible iteration of RAID and every possible kind of FAIL.

You know what the problem is? At the deepest levels of ZFS, it expects to have direct disk access, and that it have individual access to the disks. It uses the multiple disks in the vdev and knows that it's multiple disks in the vdev, and uses that knowledge to do more work in less time, to provide self-healing, and other features that you no doubt said "ZOMFG ZFS is so f*cking awesome I gotta have this right now". Well guess what? All those super cool features you wanted. Almost none of them exist now.. You damn well neutered ZFS when you chose to ignore the dozens of warnings around ehre.

So nothing personal eraser(and mosquitou), but I was dead f'in serious when I said RAID + ZFS = fail for far more ways than either of you probably understand and more ways than I'm about to explain. There's a list of reasons why RAID + ZFS = fail and if you aren't willing to take the manual at face value, then feel free to continue to do what the manual says not to do. You'll be a statistic and I'll be telling you that you should have listened to me.

But frankly, this thread has tired me out, and I have zero incentive or motivation to continue to hash out why what you are doing is wrong. It's wrong, and that's all I'm about to explain. If that answer isn't good enough, then either spend a few years learning this junk or just take my word for it.

Good luck to both of you.

eraser · Mar 15, 2014

mosquitou said:
Yes eraser, you're absolutely right. As fas as I'm concerned of history, we did not buy EonNAS because EonStor DS is cheaper and we have a spare server to run FreeNAS.

Is your eventual plan to dedicate all disks in your EonStor to FreeNAS? If so, it might be better to present each physical disk as a separate LUN to FreeNAS and do any RAID configurations on the FreeNAS server itself. But that's something that you can test later on.

cyberjock · Mar 15, 2014

eraser said:
Is your eventual plan to dedicate all disks in your EonStor to FreeNAS? If so, it might be better to present each physical disk as a separate LUN to FreeNAS and do any RAID configurations on the FreeNAS server itself. But that's something that you can test later on.

Yeah, except ZFS expects direct disk access.. which a SAN is NOT. So unless you plan to have actual physical disk access which means the disks attached directly via SATA/SAS you are still doing it wrong.

I mean.. how hard is it to read this message from the manual and just agree with it...

NOTE: instead of mixing ZFS RAID with hardware RAID, it is recommended that you place your hardware RAID controller in JBOD mode and let ZFS handle the RAID. According to Wikipedia: "ZFS can not fully protect the user's data when using a hardware RAID controller, as it is not able to perform the automatic self-healing unless it controls the redundancy of the disks and data. ZFS prefers direct, exclusive access to the disks, with nothing in between that interferes. If the user insists on using hardware-level RAID, the controller should be configured as JBOD mode (i.e. turn off RAID-functionality) for ZFS to be able to guarantee data integrity. Note that hardware RAID configured as JBOD may still detach disks that do not respond in time; and as such may require TLER/CCTL/ERC-enabled disks to prevent drive dropouts. These limitations do not apply when using a non-RAID controller, which is the preferred method of supplying disks to ZFS."

That's copy/paste straight from the manual. How hard is this? Be honest. Is what is written too complex to understand? Because frankly, I'm convinced that either I'm being trolled, people are completely and utterly incompetent, or somehow that statement isn't as clear cut as it sounds like it is to me.

Important Announcement for the TrueNAS Community.

Low NFS write throughput, desperate for help

Dabbler

Guru

Contributor

Contributor

Dabbler

Dabbler

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Dabbler

Resident Grinch

Inactive Account

Contributor

Contributor

Contributor

Dabbler

Inactive Account

Contributor

Inactive Account

Similar threads