Huge drop in read performance between single file vs. folder(s) transfers

Status
Not open for further replies.

Ace22

Cadet
Joined
Sep 13, 2017
Messages
2
Hi folks,

Problem:
When I drag a folder containing multiple sub folders/files from (Z: on Freenas file server) to (D: or C: on PC Workstation) (some small, but mostly large video files broken down into 4GB increments) I get transfer speeds of 20-50MB/s ??? I am getting >1GB/s in all other transfer situations.

- Is this what i should be seeing?
- Is there a flaw in my raid/pools design based on my performance need?
- Have I missed some settings, performance tweaking?
- Is this a hardware issue?
- Where is my bottleneck??

Any help or insight would be greatly appreciated.
If extra info required - I won't have access to the server until mid next week, and will retrieve any info requests/screenshots etc. then!
Thanks so much everyone!


Background info and system specs:

This is my first Freenas setup - I have spent many hours reading the manual and forums. Despite this have I butchered the setup?

This is a storage server for a production company using 8k video files. The requirement is to have high speed transfer between 1 primary editing machine via the mellanox connection.
I used this performance tuning guide for the mellanox 10GB connection. Down the line 10GB ethernet to other machines on the network, but we'll worry about that when the time comes.

Transferring between (C: ) and (D: ) on the editing rig I can achieve speeds of 1.7GB/s - regardless of single file or entire folder.
Transferring from the Editing rig regardless of single file or entire folder (D: ) to the Freenas Storage server (Z: ) I get variable speeds peaking around 1Gb/s (I get frequent error msgs that when accepted/ignored transfer the file anyways, but I can live with for now/will deal with that separately - not the reason I am posting here)
When transferring a large single file from storage server (Z: ) to editing (D: or C: ), I have sustained transfer speed of 1.2GB/s.
To summarize this:
- C: can read/write over 1GB/s
- D: can read/write over 1GB/s
- Z: can write at 1GB/s (performance fluctuates, I do get errors, but overall speeds are good enough)
- Mellanox network can sustain 1.2GB/s transfers both directions

*Note: I am maxing out the 64GB of RAM after about 30 seconds into a transfer, I have another 128GB arriving in a few days.

Storage Server Specs:
Freenas 9.10 or 9.11 (can't remember off the top of my head but I used latest stable avail 3 months ago)
x9 supermicro board
e5 2650 v2
24 bay chasis w/backplane
4x16gb =64GB Rdimm 1066mhz ecc
lsi 9211 8i HBA
(Z: ) 16 HGST Ultrasta 3TB HDD *All 16 drives are in a single raidz2, deduplication disabled*
Mellanox Connect ex3 running at 10GB

Editing Rig Specs:
x99 Workstation mobo
e5 2690v4
64GB ecc RAM
(C: ) 1 Samsung 960 Pro NVME
(D: ) 6x10TB Seagate Enterprise He (windows raid 0 - yes i know, no redundancy here 3 copies of data are kept so failure here is ok - I need max speed for editing 8k files) *max block size of 64k
 
Last edited:

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Hard to tell which side has the issue as the Windows client is obviously managing the copying. Do you have any modifications in your SMB config on the freenas box? For example, I have the following (Services > SMB > Auxiliary Parameters):

Code:
veto files = /Thumbs.db/.DS_Store/
delete veto files = yes
ea support = no
store dos attributes = no
map archive = no
map hidden = no
map readonly = no
map system = no


Got those from a thread here when I had a problem with slow directory listings on my cifs shares. Must faster this way. Also, no host name lookups. But I don't think this has anything to do with your issue, but there may be some SMB tuning needed?

I'm no cifs expert, sorry I don't have more to offer in the way of help.
 
Last edited:

Ace22

Cadet
Joined
Sep 13, 2017
Messages
2
Thanks Toadman,

I'll take a look at the SMB tuning when I get back to the machine, I just setup a basic windows share.

My current paranoia is that by selecting raidz2 I knee capped the potential for IOPS, and that I should have used mirrored vdevs, but it's a little crazy that transfers would go from so fast to soooo slow because of this - and only when reading?

I just don't have a true grasp on or experience with zfs so I'm hoping some other people have come across this scenario.
Normal performance
It’s easy to think that a gigantic RAIDZ vdev would outperform a pool of mirror vdevs, for the same reason it’s got a greater storage efficiency. “Well when I read or write the data, it comes off of / goes onto more drives at once, so it’s got to be faster!” Sorry, doesn’t work that way. You might see results that look kinda like that if you’re doing a single read or write of a lot of data at once while absolutely no other activity is going on, if the RAIDZ is completely unfragmented… but the moment you start throwing in other simultaneous reads or writes, fragmentation on the vdev, etc then you start looking for random access IOPS. But don’t listen to me, listen to one of the core ZFS developers, Matthew Ahrens: “For best performance on random IOPS, use a small number of disks in each RAID-Z group. E.g, 3-wide RAIDZ1, 6-wide RAIDZ2, or 9-wide RAIDZ3 (all of which use ⅓ of total storage for parity, in the ideal case of using large blocks). This is because RAID-Z spreads each logical block across all the devices (similar to RAID-3, in contrast with RAID-4/5/6). For even better performance, consider using mirroring.“

Please read that last bit extra hard: For even better performance, consider using mirroring. He’s not kidding. Just like RAID10 has long been acknowledged the best performing conventional RAID topology, a pool of mirror vdevs is by far the best performing ZFS topology.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
(Z: ) 16 HGST Ultrasta 3TB HDD *All 16 drives are in a single raidz2, deduplication disabled*
That is the problem, if it is configured the way you say. You need the break that into more than one vdev for max performance. The vdevs can all be in one pool (one drive letter) but it can't be one big vdev, if you want speed.
It is a 24 bay chassis, is it possible to buy more drives? If you could fill it and make 4 vdevs, all of RAIDz2, it would give you more potential simultaneous data paths and this is because of the way ZFS handles transfers at the software level. Each vdev (virtual device) is in effect a separate data path to the storage. The more vdevs, the more simultaneous paths and the faster the data can flow. That is a massive oversimplification, but I think it gets the point across.

What you would have would look like this:
Code:
  pool: Emily
state: ONLINE
  scan: scrub repaired 0 in 3h37m with 0 errors on Wed Aug 16 09:37:25 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	Emily										   ONLINE	   0	 0	 0
	  raidz2-0									  ONLINE	   0	 0	 0
		gptid/1c14fa53-2dcc-11e7-9529-002590aecc79  ONLINE	   0	 0	 0
		gptid/bbf7a1c8-73ee-11e7-81aa-002590aecc79  ONLINE	   0	 0	 0
		gptid/4a1279d1-6cc9-11e7-a216-002590aecc79  ONLINE	   0	 0	 0
		gptid/f375061e-6c41-11e7-b5a3-002590aecc79  ONLINE	   0	 0	 0
		gptid/c04e8d82-3d0f-11e7-b972-002590aecc79  ONLINE	   0	 0	 0
		gptid/e542985c-6c82-11e7-b5a3-002590aecc79  ONLINE	   0	 0	 0
	  raidz2-1									  ONLINE	   0	 0	 0
		gptid/ad56bf4c-7708-11e7-9f8e-002590aecc79  ONLINE	   0	 0	 0
		gptid/85963c58-6c9c-11e7-a216-002590aecc79  ONLINE	   0	 0	 0
		gptid/296866d9-6c26-11e7-b5a3-002590aecc79  ONLINE	   0	 0	 0
		gptid/c7aa351d-03fa-11e7-82c3-002590aecc79  ONLINE	   0	 0	 0
		gptid/74ed6a85-6c3f-11e7-b5a3-002590aecc79  ONLINE	   0	 0	 0
		gptid/ba7eaccb-6cc9-11e7-a216-002590aecc79  ONLINE	   0	 0	 0
	  raidz2-2									  ONLINE	   0	 0	 0
		gptid/959cd1b0-6694-11e6-867e-002590af8b19  ONLINE	   0	 0	 0
		gptid/6c70d9a6-6cb0-11e6-b454-002590af8b19  ONLINE	   0	 0	 0
		gptid/97462e3c-6694-11e6-867e-002590af8b19  ONLINE	   0	 0	 0
		gptid/bd959d2e-7896-11e6-b5ab-002590af8b19  ONLINE	   0	 0	 0
		gptid/98e163cd-6694-11e6-867e-002590af8b19  ONLINE	   0	 0	 0
		gptid/99c7cd32-6694-11e6-867e-002590af8b19  ONLINE	   0	 0	 0
	  raidz2-3									  ONLINE	   0	 0	 0
		gptid/16d1736e-8b2f-11e7-92b6-002590af8b19  ONLINE	   0	 0	 0
		gptid/cd7a14e5-8c44-11e7-92b6-002590af8b19  ONLINE	   0	 0	 0
		gptid/1fe96c3f-8dec-11e7-a5a5-002590af8b19  ONLINE	   0	 0	 0
		gptid/6458c3c0-8c77-11e7-92b6-002590af8b19  ONLINE	   0	 0	 0
		gptid/aca060b0-8d0e-11e7-a5a5-002590af8b19  ONLINE	   0	 0	 0
		gptid/01c07df6-8ae8-11e7-9652-002590af8b19  ONLINE	   0	 0	 0

errors: No known data errors


Make sense?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
PS: Your usable capacity would only increase a little but the potential speed would almost double. Based on rough estimates.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Please read that last bit extra hard: For even better performance, consider using mirroring. He’s not kidding. Just like RAID10 has long been acknowledged the best performing conventional RAID topology, a pool of mirror vdevs is by far the best performing ZFS topology.
The reason is not because it is mirrors. The reason is because it has more vdevs. If you have 6 drives in a single RAIDx2 vdev, you have only 1 vdev and if you have the same six drives in mirrors, you have 3 vdevs. The greater the number of vdevs, the greater the performance. The cost associated is what really makes the difference. If you want 42TB of storage using mirrors, you have to buy 84TB worth of disks.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
It still doesn't make sense. The OP states, "When transferring a large single file from storage server (Z: ) to editing (D: or C: ), I have sustained transfer speed of 1.2GB/s." This makes sense if Z: is a large RAIDZ2 as the data is read from multiple disks.

It's the writing that is confusing. The OP also states, "Z: can write at 1GB/s." But this statement is made in the same paragraph as the above statement. If the share is writing at 1 GB/s then I don't see why there is an issue if the data is, "some small, but mostly large video files broken down into 4GB increments." That's almost all sequential transfers.

So the read speeds makes sense. If the transfer is from the workstation to the server (which is a write to the server) then it will only be as fast as the slowest disk in the vdev, right? 20-50 MB/s seems slow, even for 5400 rpm disks. I don't understand why the net transfers are so slow if they are all a majority sequential transfer of 4GB. Even a single disk should maintain >60 MB/s or better.

All that said a single 16 disk RAIDZ2 is not the setup I'd have gone for. At least two 8 disk vdevs. Yes, a bit more space lost to redundancy, but IOPS improvement of 2x or more.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
It still doesn't make sense. The OP states, "When transferring a large single file from storage server (Z: ) to editing (D: or C: ), I have sustained transfer speed of 1.2GB/s." This makes sense if Z: is a large RAIDZ2 as the data is read from multiple disks.

It's the writing that is confusing. The OP also states, "Z: can write at 1GB/s." But this statement is made in the same paragraph as the above statement. If the share is writing at 1 GB/s then I don't see why there is an issue if the data is, "some small, but mostly large video files broken down into 4GB increments." That's almost all sequential transfers.

So the read speeds makes sense. If the transfer is from the workstation to the server (which is a write to the server) then it will only be as fast as the slowest disk in the vdev, right? 20-50 MB/s seems slow, even for 5400 rpm disks. I don't understand why the net transfers are so slow if they are all a majority sequential transfer of 4GB. Even a single disk should maintain >60 MB/s or better.

All that said a single 16 disk RAIDZ2 is not the setup I'd have gone for. At least two 8 disk vdevs. Yes, a bit more space lost to redundancy, but IOPS improvement of 2x or more.
Admittedly, my math may be all wrong, and I was making some assumptions. Here is what I was thinking, if each drive was able to sustain 150 MB/s transfer and you had 4 groups of 6 drives in RAIDz2, it should give a sustained performance to disk of 1028.57 MB/s. The formula that I use to figure that is pretty accurate, at least the results align pretty accurately with the servers I manage at work.

Putting the numbers the OP has provided into the formula, he should only be getting a sustained performance to disk of 685.71 MB/s, so I figured the slowdown he was experiencing was due to the ARC being filled and needing to pause the transaction to flush to disk. I am making some assumptions here, so it could be all wrong, but he is having problems and I thought this sounded plausible.

It could also be down to some network tuning settings. That would be worth looking at.
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
16 drives are in a single raidz2

Too wide.

I'd suggest at a minimum breaking that into two 8-way Raidz2 vdevs.

This will double your random read/write i/o.

RaidZ2 is good for sequential performance.
 

Pezo

Explorer
Joined
Jan 17, 2015
Messages
60
Here is what I was thinking, if each drive was able to sustain 150 MB/s transfer and you had 4 groups of 6 drives in RAIDz2, it should give a sustained performance to disk of 1028.57 MB/s.
How did you get that number?
I would've thought 4 drives of data per vdev times 4 vdevs should give about 2.4GB/s
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
How did you get that number?
I would've thought 4 drives of data per vdev times 4 vdevs should give about 2.4GB/s
There are so many different way to figure this that every person who does could get a different answer and if there is a definitive "right" way that will always give the one true answer, I don't know what that is.
I had to make a lot of assumptions.
One was the data rate of the individual drives. He told us they are "HGST Ultrasta 3TB HDD" but I didn't look that up to see what the actual figures are, I made a guess that it would be around 150 MB/s.
Conventional wisdom with regard to vdevs is that they give roughly the performance of the slowest disk, so if you have four vdevs, the speed of four disks except there is more to it than that.

I am confused as to how you got the number you did. That is extremely high. It is not as simple as taking the 150MB/s and multiplying it by four then multiplying that by 4 vdevs. Have you actually seen that in practice?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Putting the numbers the OP has provided into the formula, he should only be getting a sustained performance to disk of 685.71 MB/s, so I figured the slowdown he was experiencing was due to the ARC being filled and needing to pause the transaction to flush to disk. I am making some assumptions here, so it could be all wrong, but he is having problems and I thought this sounded plausible.

Exactly. I agree with you the number should be high. So let's assume the 685 MB/s is correct. So the system can do that speed at minimum. Faster if he's writing into a speedy SLOG. And that should be sustainable even if the system "pauses" to empty a txg group to disk, as it will empty at that sustained rate. i.e. on average, the system can sustain 685 MB/s (or better as he has shown). Which is why it's odd he's down below 50 MB/s. All this assumes large sequential transfers, which the OP says is his workload.

I guess in addition to the network tuning I would look specifically at the transfers and what's managing them. Windows Explorer? Maybe there is an issue there. Very odd if the OP can xfer at > 1 GB/s on very large single files and not on directories of very large single files.

If there were MANY SMALL FILES (obviously increasing the IOPS requirement, which would be bad for a 16 disk RAIDZ2 vdev), it would easily explain the behavior. Maybe the OP can give us an example of the files/sizes in the directories he's trying to move and what the xfer rate was for that example?

(Were I the OP I would still reconfigure the server regardless of solving the problem with certainty. Two 8 disk RAIDZ2 vdevs, four 4 disk RAIDZ2 vdevs, or eight 2 disk mirrors. Totally depends on the space requirements and how much redundancy is wanted.)
 
Last edited:

Pezo

Explorer
Joined
Jan 17, 2015
Messages
60
It is not as simple as taking the 150MB/s and multiplying it by four then multiplying that by 4 vdevs.
That's exactly what I did :D
Obviously I don't know much about vdevs and performance, but your number seems awfully slow, that's why I asked.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Which is why it's odd he's down below 50 MB/s.
I figured it dropped to that low speed when it was doing many small files. I have seen that happen. A single large file will transfer fast where the same quantity of data in a large number of small files will transfer slow.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Totally agree re: small files. But he says he's not doing many small files which is why it's an odd problem. Maybe he has more small files than he thinks. :)
 
Status
Not open for further replies.
Top