Mediocre NFS read performance

draggy88 · Apr 8, 2021

hi, i have setup a new lab and the NFS performance isnt what i expected.

Both servers:
2x10gig LAGG interface (LACP)
128GB ECC RAM
Intel E5-2697v3 x2 cpu
NFS 4 enabled

Truenas server:
OS: Truenas 12.0
47TB Pool with LZ4 compression
(18x4TB SATA Enterprise seagate disks in RaidZ3)
110gig free mem for cache

Client server
OS: Redhat 8.3
30gig Ramdisk(tempfs)

mount string : IP:/mnt/STORAGE2/Share2 /opt/NAS-SHARE nfs defaults 0 0

Rsync from
NFSSHARE:
to
local ramdisk
Large 10gig mediafiles.. getting like 350MB/s

All reports on the server seems ok.
Anyone know where to start to look?
Mount options?

/TE

sretalla · Apr 8, 2021

draggy88 said:
18x4TB SATA Enterprise seagate disks in RaidZ3

There's your first problem... an 18-wide RAIDZ VDEV.

Maximum recommended VDEV width is 12 for RAIDZ.

Depending on what you are looking for in terms of performance, you might want to consider moving to mirrors (https://www.truenas.com/community/threads/the-path-to-success-for-block-storage.81165/) if your objective is IOPS.

If you just want throughput, you could think about a couple of VDEVS or maybe even 3x6 in RAIDZ2.

draggy88 · Apr 8, 2021

Well, when selecting 18 disks, Truenas itself defaulted to RaidZ3..
Remember im only benchmarking READ performance, not write..
So if i create a STRIPE only pool, i should be vastly better performance read wize?

sretalla · Apr 8, 2021

An 18-wide stripe will certainly be different... I'll leave it for you to report on the results.

draggy88 · Apr 8, 2021

Ok, so i have 36x4TB drives to use. What is the reccomended usage to get one big blob to use as a NFS share? Im open to suggestions

sretalla · Apr 8, 2021

RAIDZ2 VDEVs (6 or 8 wide) might be the best compromise between a little redundancy and good throughput (for one client).

If you're aiming to have many clients, you'll need to think about IOPS a bit more, so maybe more/less wide VDEVs.

draggy88 · Apr 8, 2021

Can i create like 6x RAIDZ2 VDEVs and stripe them on top in truenas?

sretalla · Apr 8, 2021

draggy88 said:
Can i create like 6x RAIDZ2 VDEVs and stripe them on top in truenas?

If you add a VDEV (later or at pool creation time), the VDEVs are striped together.

jgreco · Apr 8, 2021

350MBytes/sec is pretty respectable for a RAIDZ3.

Also, you have to remember too that ZFS doesn't aggressively prefetch, and the only way to get really fast speeds is for the data to be coming in as fast as it is going out the ethernet. You will get stellar performance on ARC data, of course, because there is no meaningful latency involved. However, for a single stream copy of something like an ISO, you are always going to be somewhat constrained by the limitations of having to traverse the network, ZFS itself, I/O devices, etc. It is generally unrealistic to expect speeds much faster than this unless the data has been prefetched. ZFS generally needs parallel I/O to be able to get greater speeds, and that is because it is like widening a road from a single lane to two lanes. It doesn't mean that one car can go twice as fast, but it does mean that two cars can go both the speed limit. Otherwise, the latency in the system tends to dominate the maximum read speed. Ironically, you can probably write faster than you can read.

draggy88 · Apr 8, 2021

@jgreco very good points!
I just imagined that a single copy to ram on the client side would be able to "align" all the 18 drives in the pool, to give at least more than 2 drives max performance..
The data is spread over 18-3 drives, so all 15 are reading bits of the "iso"..
And since a Single Seagate Enterprice 4TB drive can easily do 150MB/s secuencially i though it would be more..

draggy88 · Apr 8, 2021

When i look at the dashboard, i get like 0% CPU usage, and i have like 110gigs of unused mem cache.. Is there a way to see the io bottleneck? Like iowait or iotop?

jgreco · Apr 8, 2021

I don't interpret what you're seeing as being a bottleneck, really. There isn't one magic thing getting in your way (the classic definition for bottleneck).

If I can be blunt, you're imagining that ZFS is going to magically understand that your intention is to read the entire file, and therefore it could "align" the 18 drives and read-ahead aggressively enough to beat 350MBytes/sec. In fact, that's the only reason you're GETTING 350MBytes/sec -- ZFS does do some prefetching. Just not enough to sustain the speeds you envision.

If you try copying the same file a few times, ARC is likely to pick up the contents and cache it, and when it does, you will find yourself smashing the network as fast as the NAS can go.

The other thing you can do is to see if there are tunables that will help you out. I am quite simply not up to date on this, so all I can really do is point you at

https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html

and look around to see if there is something that would cause more aggressive prefetching to happen.

HoneyBadger · Apr 8, 2021

The default file-level prefetch in ZFS will actually become quite aggressive over time if it detects sequential I/O. From my understanding of the code, a hit on a previously prefetched block causes it to increase the size of the next prefetch.

Client requests block 1, ZFS gambles to grab block 2.
Client requests block 2, ZFS adds weight and grabs blocks 3 + 4.
Client requests block 3, ZFS adds weight again and grabs 5 + 6 + 7.
Client requests block 4, ZFS N+1's again and says "give me four more."

Now, because of copy-on-write, those blocks might be spread across the physical disk LBAs, which limits throughput. This is probably why the speeds are capped out at 350MB/s in an 18-wide Z3. But if it's at least trying to "stay ahead of the reads" the head can at least be on the way to the right location.

As far as tuning, there's a binary On/Off for file-level as a whole, and some settings for vdev-level cache - "for a read below size X, actually read size Y of data, up to a total amount per vdev of Z"

draggy88 · Apr 8, 2021

Thx for good explanations!
I have several of these servers in my lab now and ill create a RHEL 8.3 standard NFS server from a 18disk mdraid 6 and compare ;)

no_connection · Apr 9, 2021

I don't see how read ahead would or could increase performance of sequential reads. If reads from drives are faster then network then great, but you already reached that bottle neck then. And if it's slower then it obviously won't do the "ahead" part of the read. At best it will read what is asked and at worst it will be busy reading stuff you don't need and give lower performance (that is my experience with old RAID implementations).

What it could do is help with latency of some requests that is doing slower overall reads but need the data *now* in bursts. Ofc at the expense of other IO unless there is IO to spare.

Not really sure what ZFS does but could be "might as well read while the head is there" mentality but that would only help with small stuff right? Like reading small files that where written at the same time in bulk.

jgreco · Apr 9, 2021

Read-ahead is an optimization that allows things to be ready more quickly.

If you had no read-ahead, every single disk block read would have to go through a set of things similar to the list of latency-inducing things that I describe in the SLOG sticky, such as:

jgreco said:
Laaaaaaaaatency. Low is better.

The SLOG is all about latency. The "sync" part of "sync write" means that the client is requesting that the current data block be confirmed as written to disk before the write() system call returns to the client. Without sync writes, a client is free to just stack up a bunch of write requests and then they can send over a slowish channel, and they arrive when they can. Look at the layers:

Client initiates a write syscall
Client filesystem processes request
Filesystem hands this off to the network stack as NFS or iSCSI
Network stack hands this packet off to network silicon
Silicon transmits to switch
Switch transmits to NAS network silicon
NAS network silicon throws an interrupt
NAS network stack processes packet
Kernel identifies this as a NFS or iSCSI request and passes to appropriate kernel thread
Kernel thread passes request off to ZFS
ZFS sees "sync request", sees an available SLOG device
ZFS pushes the request to the SAS device driver
Device driver pushes to LSI SAS silicon
LSI SAS chipset serializes the request and passes it over the SAS topology
SAS or SATA SSD deserializes the request
SSD controller processes the request and queues for commit to flash
SSD controller confirms request
SSD serializes the response and pssses it back over the SAS topology
LSI SAS chipset receives the response and throws an interrupt
SAS device driver gets the acknowledgment and passes it up to ZFS
ZFS passes acknowledgement back to kernel NFS/iSCSI thread
NFS/iSCSI thread generates an acknowledgement packet and passes it to the network silicon
NAS network silicon transmits to switch
Switch transmits to client network silicon
Client network silicon throws an interrupt
Client network stack receives acknowledgement packet and hands it off to filesystem
Filesystem says "yay, finally, what took you so long" and releases the syscall, allowing the client program to move on.

That's what happens for EACH sync write request. So on a NAS there's not a hell of a lot you can do to make this all better.

If you had to be doing each of those things for every block read, your disk reads would be as slow as sync writes without a SLOG. Each read transaction has to traverse that entire stack, sequentially, in a tragic game of one-block-at-a-time ping pong. There is a huge amount of latency involved for each transaction.

Now of course you can say "but we don't read a sector at a time" and you'd be right, this is just trying to illustrate the potential latency issues.

What read-ahead does, though, is it creates a much shorter list of latency-inducing events for ZFS talking to the client, and then another loop that is going back and forth from ZFS to the disks. These operate asynchronously ("in parallel") so that as long as the data the client wants is already in ARC, you are basically going across the network and asking for reads from ZFS ARC, which is about as fast as you can get for NAS. But ZFS has to be correctly understanding what it should be reading ahead, else things come to a screeching halt, and it has to go through the full chain.

Now the reality is actually a bit more complicated, but that's the general idea. If you can have a system smart enough to be reading ahead correctly, you get much closer to your theoretical I/O capacity.

draggy88 · Apr 9, 2021

So with so much free mem, how can i "force" read ahead and use the memory for it ? Or is it automatic?
I dont expect to get more than 850MB/s as this is a 2x10 LAGG interface, LACP with 1 session wouldnt give me more than 900ish MB/s

jgreco · Apr 9, 2021

jgreco said:
The other thing you can do is to see if there are tunables that will help you out. I am quite simply not up to date on this, so all I can really do is point you at

https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html

and look around to see if there is something that would cause more aggressive prefetching to happen.

HoneyBadger · Apr 9, 2021

Note: with the current ZFS code, the vdev cache is not helpful and in some cases actually harmful.

Well, that kind of ruins that potential tuning aspect. Looks like it's a binary On/Off then.

jgreco · Apr 9, 2021

The vdev cache has been disabled for, I think, almost as long as FreeNAS has been with iXsystems... was that ZFS v28?

Important Announcement for the TrueNAS Community.

Mediocre NFS read performance

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Resident Grinch

Dabbler

Dabbler

Resident Grinch

actually does care

Dabbler

Patron

Resident Grinch

Dabbler

Resident Grinch

actually does care

Resident Grinch

Similar threads