SOLVED Storage Bottleneck

Mouftik · Jul 5, 2018

Hello everybody,

I am currently having some issues determining the issue of my FreeNAS server. I have read all threads on hardware recommendation and performance benchmarking but I am not able today to figure out what I am doing wrong.

My FreeNAS system have been running for almost 1 year now, but I have not used it so much, because I didn't fully trusted it and didn't find the time to investigate much. Also I wanted to be sure that my backup solution is safe. So to begin, my system is designed like that :
OS : FreeNAS 11.1-U5
CPU : Pentium G3220 @ 3Ghz
RAM : 32GB
Disks : 6* Seagate 4To drives in RAIDZ2

What I want to do with it :
- File sharing through NFS / SMB / AFP (Connected to LDAP)
- RSync module for my other off-site backup
- (Plex Server but it's only a plus service offered as I can remotely handle that via NFS and a more powerfull VM)
So I would want to serve AFP/SMB via one interface attached to my system (on one VLAN) and the NFS on a specific Storage VLAN.

My problem is the bandwidth I get through my system. I have tested my network through iperf and I get 950-980MBits per second which is not perfect but it's ok.
So I decided to test directly the performance of disks via the dd command like in this thread. As a result I get roughly 300MB/s in write speeds. I am a little bit surprised to not get better results but in RAIDZ2, data are written in N-2 disks so with 6 disks data are written to 4 which gives me 75MB/s each, fair enough.

But when I try to mount my dataset via NFS on a VM wired to the Storage VLAN network, and try to dd in it with the same command ... I get a max of 60MB/s. Not really stunning.
What I found weird is the RDD graph of my interface, even if I get a full 950MBits/s of throughput in the console (I tested FreeNAS as Server of Client), it shows 450MBits/second on my interface

I also tried to do a real world test, with large files, and also get this 60MB/s. And during this, my CPU didn't even get upper than 20% usage.

At this point I am a little bit disappointed, and don't know what could be wrong with my configuration. I mean, i hopped to saturate a gigabit link through this machine.
Did I need to tune some configuration as a basic (like using auto-tunables) ? Or maybee some packet fragmentation which slow NFS performances ?

Dice · Jul 5, 2018

What motherboard is used?
What NIC is used?

Mouftik · Jul 5, 2018

It's an Asus Z97-K motherboard with some Realtek NIC I think due to it's re0 name.
But even if I know the quality and love of Intel NIC in FreeNAS. But if I transfer a very large file through SMB, I have a consistent 112MB/s so I fully saturate the Gigabit connexion. The problem may appear only in NFS.

Mouftik · Jul 6, 2018

After further investigation, I am seeing something weird ... I've tested my disks and see nothing bad. Here is the results I have for dd commands (Dataset without compression and Sync disabled):

 $ dd if=/dev/zero of=tmp.dat bs=2048k count=10k

Results = 395 MBytes/s

 $ dd if=tmp.dat of=/dev/zero bs=2048k count=10k

Results = 456 MBytes/s

I thinks those results are not too bad ?
I also tested with sync enabled but ... only having 5MBytes/s difference.

So now I test the same thing but via NFS I am stuck to 60-65MBytes/s ...
Like I said before, my FreeNAS is connected to multiple VLAN so in the RDD I have the interface re0 graph, and each VLAN graph (I have 3 VLANs). And during the NFS, I see that I have 500MBits/s of traffic on the re0 interface and 500MBits/s of traffic on my VLAN interface.
Does the re0 graph reflects the total amount of data passing through this interface (Sum of all my VLANs) ?
If it's not, I don't understand why I would have any traffic coming directly to the interface ...

I also tested the same through SMB and AFP on the same file. Result is always a full 120MBytes/s on both directions.

garm · Jul 6, 2018

Have a look at this https://blogs.oracle.com/roch/nfs-and-zfs,-a-fine-combination

Mouftik · Jul 6, 2018

garm said:
Have a look at this https://blogs.oracle.com/roch/nfs-and-zfs,-a-fine-combination

Ok so for the theorical part, I may understand why in my case NFS may be slower than SMB of AFP. Even if I am asking myslef about data corruption in SMB/AFP ... Did they handle that in another way, or simply not ...

But what could be a valid solution ? I doubt that in production environments they sacrifice performance on single threaded application. If you want to move a large file over a network in a data center I don't think that 60MB/s is enough. And testing exporting in NFS a platter disk on an old computer with just a Debian gives me 75MB/s which is the single disk speed (Didn't change anything about NFSd configuration). I am a little bit confused at this point, what could I have done with the configuration which throttle the NFS so much on my FreeNAS ?

HoneyBadger · Jul 6, 2018

What is the O/S of your Plex VM, and is its NFS client requesting synchronous writes?

Mouftik · Jul 6, 2018

For now Plex is used as a Jail in my FreeNAS, but I don't want to waste Memory or CPU with this service. So I can move it to a Debian VM on my ESXi rig. I want my NAS to serve as a fine-tuned storage media which I can rely on and serve files as fast as possible to my clients.

For Synchronous writes ... I don't fully understand your question. I think it's default mount options in Debian and OS X (using both for testing). And default options has been used in both tests I have made.
Debian's default mount options in Jessie's Debian documentation are :

Code:

rw,suid,dev,exec,auto,nouser,async

so should be async writes in my case.

Also I don't know if it's important or not, but the FreeNAS server is attached to it's own UPS, and extinction order leads FreeNAS to be the last system to shutdown (Because he wouldn't rely on any other machine of the network).

HoneyBadger · Jul 6, 2018

Could be the default rsize/wsize values for NFS. I believe Debian defaults to a wsize of 8K on NFSv3, but it can be overridden in the mounts by specifying the byte limit with the "wsize=XXXXX" parameter (in multiples of 1024 - try increasing to the NFSv4 default of 32768)

Have you enabled NFSv4 on FreeNAS?

Mouftik · Jul 9, 2018

I didn't used NFSv4, what a difference can that make ? Like SMB and multi-stream data ?
I will try to increase the wsize but I was thinking of a Server-side stuff. Can a receive / send buffer make a difference ?

I will need to test it in some days, 2 drives of my RAIDZ2 6 drives have failed, so I will not stress them now :D Waiting for new disks to arrive and will test this NFSv4 and size change.

HoneyBadger · Jul 9, 2018

The default rsize/wsize under NFSv4 on Debian default to 32K. They can be tuned up larger on either v3 or v4 if you specify it, which I'd suggest you do.

rsize/wsize specifies the maximum size of packet that NFS will send. Bigger number means more throughput but possibly higher latency - in this case where you're just shuttling large files, use the larger block size to increase the throughput.

And yes, if you have 2 drives failed in a Z2 vdev, definitely a good idea to minimize activity there.

Mouftik · Jul 10, 2018

So my 2 spare drives arrived and successfully resilver :D ! Thanks ZFS for only rebuilding the needed space instead of bits by bits RAID conventionnal resilver !

The default rsize/wsize under NFSv4 on Debian default to 32K. They can be tuned up larger on either v3 or v4 if you specify it, which I'd suggest you do.

This is really interesting because I have some datasets with huge files (the ones with encrypted data) and ones with very small one (conf files from docker) so I can set sizes depending on the latency/throughput I want. I will try to do some tests on the large files with jumbo frames + high rsize/wsize.

Next step would be to add more than 1 ethernet port (This time base on Intel ones) and Aggregate those links to get more thant 1GBits/s throughput

HoneyBadger · Jul 10, 2018

Mouftik said:
Next step would be to add more than 1 ethernet port (This time base on Intel ones) and Aggregate those links to get more thant 1GBits/s throughput

Unfortunately that won't work quite as well as you intend it. Even if you configure your NFS client to spawn multiple connections, the load-balancing algorithms are based on source/destination MAC/IP addresses, which won't change, so all connections will still route down one path in order to preserve packet order.

And pNFS isn't in FreeNAS yet (I believe it's roadmapped for version 12) so you can't use that either.

You'll have to go to 10GbE or iSCSI to increase your throughput.

Mouftik · Jul 11, 2018

Yeah that's what I am seeing with my ESXi Server ... Too bad !

But Link Aggregation didn't work well with FreeNAS ? Because I have a D-Link 1210-24 which handles Link Aggragation directly. I am not allowed to use it with ESXi because they only handle Cicso stuff ... anoying !

Elliot Dierksen · Jul 11, 2018

Mouftik said:
But Link Aggregation didn't work well with FreeNAS ? Because I have a D-Link 1210-24 which handles Link Aggragation directly. I am not allowed to use it with ESXi because they only handle Cicso stuff ... anoying !

Asking as a networking person, what do you mean by the comment above? It doesn't make sense to me. Any form of link aggregation in the ethernet world is really load balancing, not actually binding links together. If you have 4 links, each side of the link independently applies a hashing algorithm to determine which link this conversation will use. In perfect world (nowhere near the one we inhabit), 4 clients talking to a server with 4 links, each conversation would traverse a different link giving the server an effective aggregate bandwidth of 4G. Each individual conversation can only get the speed of a single link. The problem is that it you have to choose the right algorithm. Take an example of a FreeNAS connected to a switch like yours and 4 clients on the same IP network. Most devices default to using the destination as the load balancing method. Since all traffic from the switch bound for FreeNAS will have the same destination, then all traffic picks the same link. In this scenario, the ideal way would be to have FreeNAS send to the switch load balancing based on destination. The switch should load balance to FreeNAS based on source. It all sounds easy, but it is frequently more complicated/tedious than is seems. End of rant....

Mouftik · Jul 11, 2018

Elliot Dierksen said:
Asking as a networking person, what do you mean by the comment above? It doesn't make sense to me. Any form of link aggregation in the ethernet world is really load balancing, not actually binding links together. If you have 4 links, each side of the link independently applies a hashing algorithm to determine which link this conversation will use. In perfect world (nowhere near the one we inhabit), 4 clients talking to a server with 4 links, each conversation would traverse a different link giving the server an effective aggregate bandwidth of 4G. Each individual conversation can only get the speed of a single link. The problem is that it you have to choose the right algorithm. Take an example of a FreeNAS connected to a switch like yours and 4 clients on the same IP network. Most devices default to using the destination as the load balancing method. Since all traffic from the switch bound for FreeNAS will have the same destination, then all traffic picks the same link. In this scenario, the ideal way would be to have FreeNAS send to the switch load balancing based on destination. The switch should load balance to FreeNAS based on source. It all sounds easy, but it is frequently more complicated/tedious than is seems. End of rant.... :)

I'm also more a networking person, and for what I remember, what you describe is NIC Teaming. You have multiple links which balance the traffic by hashing it and disperse packet through multiple links.
Good thing is that you don't need any switch side configuration, because the switch will see the machine with multiple MAC address, each coming from the Network card itself. Drawback is like you said, you can not excess a 1G speed for 1 client. No Negotiation, No configuration needed, you only need to bound links together in Linux so you can handle multiple 1G links, and if a Network card goes wrong, you discard it from the bound.

LAG is Link Aggregation Protocol which allows to have more than 1G/client (which 802.11ac is ... you aggregate multiple channels to achieve 1.3GBits/s at max). In this case, the switch sees you as one MAC Address so only one client but on multiple port. But to achieve that you need to negotiate with the switch and negociate via LACP for the two side to handle traffic correctly. But in this case, packets go through a unique "virtual interface" which distribute the packets over the physical interfaces. So you are able to handle a 4G traffic in a 1 to 1 link.

I didn't take my network courses in English, so that's why I may not have the right terminology for the technology. As Far as I see, NIC Teaming and LAG are named "Link Aggregation" but serves different solutions.

PS : For the ESXi part, they only handle CISCO proprietary solution (EtherChannel & PAgP) and HP solution. But they didn't handle the IEEE LAG specification AFAIK that's why I said it was annoying :)

HoneyBadger · Jul 11, 2018

ESXi does do LACP, but not on a standard vSwitch.

Sending packets "round-robin" across all members of a LAG violates Ethernet frame ordering rules, and if there's one type of traffic that responds extremely poorly to out-of-order frames, jitter, and other variances, it's storage. Hence you're limited to using the "hashing" methods that will decide which single path to send the traffic down.

As @Elliot Dierksen mentions, in a perfect world, with four clients having 1Gbps links on the other end of that 4Gbps LAG, each one is hashed and assigned to an unused link, and each gets a full 1Gbps of speed. But we don't live in that world, so two of those clients will end up hashed to the same interface and have to share while one link sits idle.

Elliot Dierksen said:
In this scenario, the ideal way would be to have FreeNAS send to the switch load balancing based on destination. The switch should load balance to FreeNAS based on source.

Or just set the entire link up to negotiate with src-dst-ip and then they'll both mix it up as best as possible.

Elliot Dierksen · Jul 11, 2018

HoneyBadger said:
ESXi does do LACP, but not on a standard vSwitch.

Sorry but I am picking nits today. :) LACP versus static is how a device decides if it going to do some form of aggregation. How it decides to use that link aggregation once it decides that it IS going to do aggregation is a different issue. You are correct about ESXi. You can do IP hash load balancing in a standard vSwitch, but that is static. I always prefer LACP where possible.The reason for that is when the aggregation is statically configured, it will use that link if it is physically up. It might be physically up, but that doesn't mean the other side is actually ready to handle traffic on that link. This is a definite issue when you have stacked switches. The port is up, but everything hasn't fully booted and synchronized. LACP means it won't start putting traffic on that link until the LACP neighbor responds to the negotiation. A best case, but you have to have an much higher (and more expensive) ESXi license to use that.

HoneyBadger said:
Sending packets "round-robin" across all members of a LAG violates Ethernet frame ordering rules, and if there's one type of traffic that responds extremely poorly to out-of-order frames, jitter, and other variances, it's storage. Hence you're limited to using the "hashing" methods that will decide which single path to send the traffic down.

Not only that. If you configure your switch to hash load balance but ESXi is not using IP hash, ESXi will discard frames that it thinks came in on the wrong port.

HoneyBadger said:
Or just set the entire link up to negotiate with src-dst-IP and then they'll both mix it up as best as possible.

It depends on the device as to what method of hashing it uses to pick the link. Some do it more intelligently than others. You have to test things out and see if it is operating as expected. If not, you might have to try different hash methods.

Mouftik · Jul 13, 2018

HoneyBadger said:
ESXi does do LACP, but not on a standard vSwitch.

Yes ! LACP is only possible with a more advances options coming with the distributed vSwitch, which also needs the vCenter stuff ... expensive for only having LACP activated :)

Elliot Dierksen said:
It depends on the device as to what method of hashing it uses to pick the link. Some do it more intelligently than others. You have to test things out and see if it is operating as expected. If not, you might have to try different hash methods.

I've also seen some intelligent ones which monitor the activity of an interface and pick the less crowded one, so network activity should escalate on both links in parallel ...

To come back to some NFS related stuff, it may appears that my testing machine was the cause of all my problem ! I decided to reinstall from scratch a Debian based 8.10 Jessie and only use my storage network in the network config and that's the results :

Code:

$ sudo mount -t nfs -o defaults 171.16.2.9:/mnt/TestDataset/Benchmark /mnt/test/
$ dd if=/dev/zero of=/mnt/test/tmp.dat bs=2048k count=5k
5120+0 records in
5120+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 98.8933 s, 109 MB/s

Another test with some specific configuration and dd Size > RAM

Code:

$ sudo mount -t nfs -o nolock,noatime,nodiratime,rw,hard,intr 171.16.2.9:/mnt/TestDataset/Benchmark /mnt/test/
dd if=/dev/zero of=/mnt/test/tmp.dat bs=2048k count=20k
20480+0 records in
20480+0 records out
42949672960 bytes (43 GB, 40 GiB) copied, 388.87 s, 110 MB/s

It appears I don't even need to modify the rsize/wsize on a 1G link ... even if I may do it on datasets I know I will only move >100MB files.

So I'm fully happy with the configuration and Testing. I can now have 100MB/s over NFS (and SMB/AFP) without having to disable sync or whatsoever which can cause data corruption ! I may have to tune some properties, because in some datasets I move a lot of small (1MB-10MB) files but performance is here so I'm happy for now :D

Chris Moore · Jul 13, 2018

I didn't go through the whole thread, so I apologize if someone already hit this. NFS is always synchronous and that means that you need a SLOG device and that's a feature of the filesystem

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Important Announcement for the TrueNAS Community.

SOLVED Storage Bottleneck

Dabbler

Wizard

Dabbler

Dabbler

Wizard

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

Guru

Dabbler

actually does care

Guru

Dabbler

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Storage Bottleneck"

Similar threads