Write Performance Issues mid-level ZFS setup

Matt Zell · Jun 19, 2013

Hi!

I am setting up a new shared storage solution at the business I work at. We primarily work as a full service post production facility for television and film production as well as VFX. So our main application would be the ability to serve and write media files. The majority of these files would be image sequences (dpx and tiff sequences) as well as .mov files mixed in there (Prores and DNX mostly). Because these are editing/coloring/VFX applications, they all demand a high amount of data in bursts as well as the need to have consistent enough to support realtime playback of image sequences/.mov files.

Our facility is fully wired with 10G optical cables, at least to the machines that will need the bandwidth. I will focus on my personal work machine that I have been testing with and the actual server itself. However the majority of our facility is OS X with a few more windows machines creeping their way in (due to the non-expandability of the new mac pro...separate topic).

Because of this I would either like to highly optimize AFP or NFS.

Our shared storage is a 4U 24 hot swap bay supermicro case with the following internals:

Chipset - SUPERMICRO MBD-X9DR3-LN4F+-O
CPU'S - 2X -Intel Xeon E5-2620 Sandy Bridge-EP 2.0GHz (2.5GHz Turbo Boost) 15MB L3 Cache LGA 2011 95W Six-Core Server Processor
HBA'S - 3X -LSI LSI00301 (9207-8i) PCI-Express 3.0 x8 Low Profile SATA / SAS Host Controller Card
HDD - 20X - Seagate Constellation ES 2TB 7200RPM 6GB/s SAS 64MB Cache Drives
RAM - 48GB for test period, production box will have order for 128GB
Networking - 2X - Myricom 10G-PCIE2-8b2-25 Lanai Z8ES Chipset based 2 port nic
I also have at my disposal 2x Seagate 600 Pro Series 240GB 2.5" SATA III MLC Enterprise SSD's that I ordered in case it was needed

The box I am running it to for these test speeds is the following:

Mac Pro 2.93 Ghz 2x6core Xeon
96GB Ram
Myricom 10G-PCIE2-8b2-25 Lanai Z8ES Chipset based 2 port nic
OS X 10.8.3

For the zpool setup I have tried the following all using FreeNAS 8.3.1 p2:

Raid 10-like setup with nested mirror vdevs striped
Raid 50-like setup with nested 4-disc raidz vdevs striped

I mention this because both of the setups had negligible performance difference in my tests so I am led to believe that this may be due to a different problem all together.

I will mention before I go any further that I am only using one port of one 10G nic on the server at the moment as I am setting it up, and that the traffic by in large is going through a cisco enterprise fiber switch with a fabric extender. The engineer who set up our switch informed me that the switch in fact was only enabled to work with 1500 MTU instead of the 9000 MTU. Because of this I have set the nic to 1500 on both sides. I did however try directly connecting the two machines and setting 9000 MTU on both with minimal gains in my testing.

I have gone through all of the proper tuning on both ends with the Myricom cards as found here: https://www.myricom.com/software/my...w-myri10ge-or-mx-10g-performance.html#freebsd

My benchmark for better or worse is Blackmagic Disk Speed Test. A lot of engineers and their friends will instantly scoff at me when I tell them this is my benchmarking tool however this is what our facility uses to help compare different systems. In some ways this testing tool is appropriate in that we are a post facility and this does help determine latency and available bandwidth for realtime playback.

My results are as follows:

For AFP I can get on average 220-250MB/s write and 600-670MB/s read
For NFS I can get on average 140-160MB/s write and 150-160MB/s read (using 24 threads)
For CIFS I can get on average 150-180MB/s write and 200-210MB/s read

The read on AFP is right around the area we would be very happy with. The rest of the speeds I am coming across (mostly write speeds) are a bit lacking.

I have done massive amounts of research, read all of the manuals, peeked at the evil tuning guide (haven't tried any of those just yet) and am really just coming up short.

I have read some of the moderators on this thread saying that they are able to get 1G write speed on their home systems and I would be enormously happy just to get 450-500MB/s write. Heck, as it stands, AFP for read already blows me away on 1 10G connection! I just would like to know what I could do to squeeze any more performance for writing files to this server. We are looking to aggregate those 4 10G slots in the near future (however I am trying to get them to tackle rolling out 9000 MTU first and both need to be supported from the switch).

I do apologize for the novel I have just written but I have read countless threads where people did not provide enough information and then 15 posts later finally got to the route of the problem. So I decided to swim a little upstream to wish for the best.

Thanks in advance for your time and for any advice you are willing to provide!

jgreco · Jun 20, 2013

You've written one of the better questions to show up here in awhile, unfortunately also harder to address because you've also not made any super-obvious mistakes.

I'm going to make some commentary that is designed to be a starting point for your own investigation rather than a comprehensive answer; I don't have a comprehensive answer.

Most people here are not using 10G and the normal problems we address here are people having a problem maxxing out 1 or maybe 2 1GbE's. I am probably one of the few people here with any 10GbE. It may not be what you want to hear, but checking with iXsystems about a support contract might be a wise investment.

I was disappointed to see you had three 9207-8i HBA's; that seems like it'd be an optimal situation for ZFS, and the more-usual selection of a high density RAID card is one of the problems I like to highlight. Nevertheless, I have a question for you: run "camcontrol devlist -v" and report the driver being used to attach, like "mps" or "mpt". I expect it is "mps" which is good. Now from that same dev list, you'll see a list of all your drives like "da2", "da3", etc. Please do a parallel dd on all of them. From the console shell prompt, something like

csh# foreach i (da2 da3 da4 <etc>)
foreach? dd if=/dev/${i} of=/dev/null bs=262144 &
foreach? end

which will spawn off some parallel reads of all your disks. Now if you run "iostat da2 da3 da4 1" (note the 1!) it'll show you raw transfer speeds. Let it run awhile and establish an average (need not be scientific, eyeball is fine).

Now stop all the dd's and repeat test on a single drive. Should hopefully be the same. That means you probably don't have any serious I/O contention problems. It also gives you an idea of what your hardware cannot exceed. But be aware that ZFS is incredibly piggy, the tax we pay for awesome capabilities is significant. If you figure your array capable of 150MB/sec/drive (3000MB/s aggregate for 20), it still wouldn't shock me to find ZFS limiting you to a fraction of that.

So next go to your pool and do a "dd if=/dev/zero of=testfile bs=1048576" and then let that run for 10 minutes, hit ^C, and see how fast she writes. Then you can reverse the test with "dd if=testfile of=/dev/null bs=1048576" for read.

You've now established potential and actual numbers. We can discuss those a little further once you have them. Expect to be disappointed that your potential is substantially greater than actual. That's just a ZFS thing. And of course your protocol numbers are less than actual numbers on the console. Tuning can reclaim some portion of the various taxes in the various layers, but it may require some work.

So I'm going to move on to my primary item of concern: those E5-2620's. I would rather have fewer fast cores for a high performance fileserver than more slow cores. I have an E5-2609 in the box on the bench (2.4GHz/4C/10M/no threads) and it is pretty butt-draggy compared to the E3-1230's (3.2GHz, 3.6 turbo/4C/8M/threads) we kind of love around here .... and you've managed to find an E5 that's actually slower than the cheapest POS 2011 I could scrounge up for bench use. Now, the awful bad news is that my needs and yours totally diverge: I'm playing the waiting game for the new E5 v2's to come out and hoping to score a 10- or 12-core for maybe $2K, because I'm all about the virtualization. The E5's are not strong on core speeds, you're mostly lucky if you can get north of 3GHz, but you can get 8 cores of it. The E3 low ends at 3GHz, with the top being 3.7-turbo-to-4.1. That's like twice your CPU core speed. Eugh!

So, if money were no object, what I'd suggest is to pull the dual 2620's and replace them with a single E5-2643, which is probably the only E5 that is meaningfully competitive with the E3's on a clock-speed basis. I'm guessing - and honestly it IS only a guess - that you would realize greater benefit from being able to run a relatively small number of tasks at 50% greater speed.

Matt Zell · Jun 20, 2013

Thanks so much for your detailed reply and suggestions!

With camcontrol devlist -v all three HBA's are using the mps driver.

For the parallel reads here is the result:

Code:

     tty             da1              da2              da3              da4              da5              da6              da7              da8              da9             da10             da11             da13             da14             da15             da17             da18             da19             da20             da21             da22             cpu
 tin  tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
   0   112 127.38  46  5.73  127.37  47  5.79  127.35  44  5.43  127.39  47  5.85  127.23  46  5.76  127.18  45  5.54  127.26  46  5.76  127.21  45  5.53  127.41  44  5.53  127.38  46  5.66  127.40  45  5.57  127.34  44  5.52  127.35  46  5.67  127.36  46  5.71  127.37  46  5.73  127.35  45  5.59  127.34  44  5.41  127.36  45  5.65  127.36  44  5.46  127.37  46  5.68   0  0  0  0 99
   0  1120 127.36 1158 144.04  127.33 1104 137.30  127.23 1132 140.67  127.27 1187 147.54  128.00 1233 154.14  128.00 1229 153.64  128.00 1198 149.76  128.00 1187 148.39  128.00 1166 145.77  128.00 1193 149.14  128.00 1107 138.40  128.00 1146 143.27  127.24 1135 141.05  127.24 1147 142.55  127.37 1183 147.16  127.02 1141 141.56  127.21 1103 137.05  127.21 1104 137.18  126.97 1082 134.19  127.35 1142 142.04   2  0  8  2 88
   0   427 127.22 1109 137.77  127.20 1090 135.40  127.24 1134 140.89  127.24 1148 142.64  127.57 1147 142.88  127.56 1136 141.51  127.55 1092 136.01  127.43 1093 136.02  127.57 1155 143.88  127.46 1147 142.76  127.53 1049 130.64  127.44 1113 138.51  127.67 1140 142.13  127.69 1184 147.62  127.70 1224 152.62  128.00 1199 149.86  127.68 1151 143.50  127.67 1114 138.88  128.00 1121 140.12  127.68 1168 145.62   0  0  7  2 91
   0   427 127.58 1175 146.37  127.58 1178 146.75  127.29 1051 130.65  127.40 1231 153.12  127.78 1119 139.62  127.89 1120 139.87  127.90 1221 152.48  127.88 1050 131.13  128.00 1233 154.10  127.90 1209 150.98  128.00 1213 151.60  127.89 1163 145.24  127.24 1140 141.64  127.23 1132 140.64  127.36 1160 144.26  127.36 1157 143.88  127.32 1087 135.14  127.32 1087 135.14  127.54 1086 135.26  127.54 1087 135.38   0  0  6  2 92
   0   427 127.79 1208 150.73  127.80 1253 156.35  127.53 1053 131.14  127.59 1200 149.49  126.40 1075 132.69  126.21 1098 135.32  126.69 1130 139.79  126.28 931 114.82  126.55 1108 136.92  126.34 1039 128.19  126.20 1095 134.94  126.08 1024 126.07  126.17 1080 133.06  126.09 1101 135.57  126.15 1067 131.44  126.25 1127 138.94  127.55 1108 138.00  127.54 1087 135.38  127.78 1123 140.12  127.78 1102 137.50   0  0  6  2 92
   0   425 127.56 1135 141.39  127.56 1126 140.27  127.54 1072 133.54  127.54 1083 134.91  127.01 1114 138.18  127.00 1102 136.68  126.90 1125 139.43  126.68 936 115.74  126.82 1043 129.20  126.90 1010 125.21  126.99 1090 135.19  126.92 1022 126.71  127.36 1146 142.53  127.47 1158 144.15  127.52 1260 156.87  127.46 1131 140.78  128.00 1203 150.36  128.00 1145 143.12  128.00 1179 147.36  128.00 1130 141.25   1  0  5  2 92
   0   426 128.00 1236 154.49  128.00 1236 154.49  128.00 1164 145.50  128.00 1208 150.99  128.00 1244 155.49  128.00 1251 156.36  128.00 1267 158.36  128.00 1112 139.00  128.00 1166 145.75  128.00 1225 153.11  128.00 1120 140.00  128.00 1140 142.50  128.00 1186 148.24  128.00 1187 148.37  128.00 1243 155.36  128.00 1186 148.24  128.00 1210 151.24  128.00 1187 148.37  128.00 1212 151.49  128.00 1160 145.00   0  0  6  2 92

it's showing around (if i'm reading it correctly) around 150MB/s read from each drive?

Testing dd with iostat of one drive provides just around the same results (150MB/s)

Inside my pool, running the write dd for around 8 minutes provided this result:

Code:

dd if=/dev/zero of=testfile bs=1048576
^C449135+0 records in
449134+0 records out
470951133184 bytes transferred in 465.301403 secs (1012142087 bytes/sec)

Which is roughly 965MB/s write.

During this process "Reporting" in the web portal shows around 30G of Memory wired, and CPU hitting around 88.37%

As for the read:

Code:

dd if=testfile of=/dev/null bs=1048576
449135+0 records in
449135+0 records out
470952181760 bytes transferred in 304.274736 secs (1547786016 bytes/sec)

Which is roughly 1476 MB/s

Unfortunately, the hardware especially the CPU's are likely staying with this system. On the bright side, those numbers that I tested in the pool are very acceptable for our purposes.

10Gige at it's theoretical maximum can push/pull 1280 MB/s . I know I should only expect 60-70% of that in reality and that is sort of what I'm getting with the read of my AFP share. However I don't know how I'm losing so much speed on the write using every protocol.

It a bit of a relief to run these tests to show that I inherently do not have a hardware problem though, so thank you for that!

Should I now be looking into tuning the nic's? Or tuning the AFP protocol itself?

Thanks again for the input!

jgreco · Jun 21, 2013

That's not bad for pool speeds, somewhat better than I was expecting (was guessing maybe 2/3-3/4 of that). But you notice the big huge difference between the 3000MB/sec the individual drives are capable of when aggregated and the pool speeds. Some of that is unavoidable ZFS overhead. You might be able to squeeze somewhat more pool speed with some tuning.

But it appears that the largest percentage drop here is protocol-based, which is both good and bad news. Samba is notoriously well-known for having performance that tracks pretty closely with your CPU speed, and basically you probably need to read up on tuning both clients and Samba to see what changes would have the most impact. Samba is widely-used and you need not focus on just FreeNAS-related resources. For netatalk/AFP, I don't really have any significant experience with performance tuning. For NFS, I'll note that FreeBSD 9 uses a different NFS implementation than FreeBSD 8. I don't know which works better (yet).

cyberjock · Jun 21, 2013

Stuff I've seen that helps 10Gb LAN ports:(I use these but I haven't tested them as I've never had a 10Gb LAN card at my disposal)

1. Enable autotune(2 reboots are necessary for the settings to be created and applied if I remember correctly).

2. Sysctls:

net.inet.tcp.delayed_ack=0 -This disables Nagle's Algorithm. It may be the default, but there isn't much use for it at ultra high speeds like what you have.

3. Tunables:

kern.ipc.nmbclusters=262144 -This increases the Mbufs available and can help increase throughput of the NIC. You could also try double that value if it helps but you want even more speed. The amount of RAM that uses is negligible considering your system. There was a thread where someone's 10Gb NIC wouldn't work without this setting.

4. If you are using CIFS try using the default settings with no tweaks. More often than not tweaks don't help(and can often hurt) if you aren't use to tweaking FreeNAS/FreeBSD.

5. Check that you are using the 64 bit version of FreeNAS. It would be a shame to make such a noob mistake as installing the 32 bit version on that hardware. :P

6. When I did testing with jumbo packets I found they really don't add much value on local LANs, except in very very specific situations such as database queries where the data will always be slightly larger than the MTU requiring 2 packets to be sent. So much networking hardware and software doesn't play right with changing MTU from the default I don't even bother to use it even if I'm certain that everything should be okay. Things just backfire later when you install an update that breaks jumbo packet support and you can't figure out why your transfer speeds suddenly tanked and you can't figure out why. Just not worth the time for troubleshooting issues later IMO.

7. If none of that works and you want to do some one-on-one troubleshooting send me a PM and I'll see if I can take a look.

Matt Zell · Jun 23, 2013

So I was working on performance issues again on Friday and by the end of the day was giving up on AFP/Netatalk pretty much all-together. No matter what I tweaked I was still only maxing out around 280MB/s Write and 600-680MB/s read.

The read is great however the write was still not cutting it. I am settling on using NFS for OSX and CIFS for Windows. NFS can get me a clean 400MB/s Write and 400-480MB/s Read

On the contrary, CIFS on Windows is doing super well with around 600-650MB/s Write and 450-550MB/s read!

I do wonder if the new version of NFS would provide greater speeds.

Cyberjock:

Thanks for your 10Gbe tips! They are actually very close to what I had been experimenting with.

I read over this blog post: https://calomel.org/freebsd_network_tuning.html

Which does very similar things to the autotune. I was placing them into the Tunables and Sysctls respectively as listed on the blog, instead of modifying /boot/loader and /etc/sysctls.

The one point you made about nmbclusters I think I played with on my client device but I do seem to have left out that Tunable on the server. That could certainly provide a huge difference, Thank you!

I am in fact using a 64 bit version of FreeNAS.

My new questions stem from curiosity of NFS since I've done about enough amateur TCP tuning to get minimal gains for AFP at this point.

I've read a lot about issues with NFS over TCP and how it will cause a lot of issues with performance and reliability. Does anyone know if NFS on FreeNAS uses UDP by default?

I would love some tips for tuning NFS on FreeNAS if anyone has some experience.

Thanks again!

slowfranklin · Jun 24, 2013

Matt Zell said:
So I was working on performance issues again on Friday and by the end of the day was giving up on AFP/Netatalk pretty much all-together. No matter what I tweaked I was still only maxing out around 280MB/s Write and 600-680MB/s read.

fwiw, you may check with truss/ktrace/dtrace whether Netatalk is issuing disk IO in very small chunks of 8KB. This ought to be at least 256k, but afair there was a bug in early 3.0 releases that caused this. I can easily saturate 10 Gbit with latest Netatalk, given that the disks can cope with reads and writes (using SSDs in a stripe on a test system), on Solaris though, not FreeBSD. You also have to consider that the limiting factor may be the client, so before measuring network performance and that protocol layer, I'd do some tests with netperf [1].

-r

[1]

Code:

netperf -H SERVER_IP -fM tTCP_STREAM -l60
netperf -H SERVER_IP -fM -t TCP_RR -- -r 32,1048576 -s 2m -S 2m
netperf -H SERVER_IP -t TCP_SENDFILE -F /tank/bigfile -cC -- -s 1m -S 1m

slowfranklin · Jun 26, 2013

slowfranklin said:
fwiw, you may check with truss/ktrace/dtrace whether Netatalk is issuing disk IO in very small chunks of 8KB. This ought to be at least 256k, but afair there was a bug in early 3.0 releases that caused this.

D'oh! Afair FreeNAS still uses Netatalk 2.2 (for whatever reason), so that's definitely doing IO in the write case in chunks of only 8KB which is the reason for the bad performance. The problem is solved in Netatalk 3.

Matt Zell · Jun 27, 2013

slowfranklin said:
D'oh! Afair FreeNAS still uses Netatalk 2.2 (for whatever reason), so that's definitely doing IO in the write case in chunks of only 8KB which is the reason for the bad performance. The problem is solved in Netatalk 3.

Thank you for the tip!

I am going to try and compile a 3.0 AFP tomorrow at work to see if it provides better results. I do realize, unfortunately that this will mean that I wont be able to administer afp from the browser anymore because all of the config files changed in 3.0. Oh well back to nano.

slowfranklin · Jun 27, 2013

There's a webmin module for Netatalk 3. It's still only available in git, no package yet:
https://github.com/Netatalk/webmin-module.
-slow

jgreco · Jun 28, 2013

Doing I/O in 8KB chunks will incur higher syscall overhead, but ZFS does aggregate writes into a transaction group before committing them to the pool, unless you're trying to do sync writes, which doesn't appear to be the case here.

# dd if=/dev/zero of=testfile bs=8192

gets me 247MB/sec while

# dd if=/dev/zero of=testfile bs=262144

gets me 275MB/sec. This should give you some idea as to the impact of the smaller write size as it affects the system and ZFS.

Created for your issue:

https://support.freenas.org/ticket/2357

We make some minimal use of AFP here, so I could only make the most basic of comments. Please feel free to take and expand that ticket if there are further things that ought to be included.

slowfranklin · Jun 28, 2013

Well, if Netatalk would only be writing 8KB of data to the filesystem in a loop then it wouldn't be that bad, as you have shown. But that's not the case, before Netatalk can write the data to the filesystem it must receive it from the network, so what it does per 8KB chunk is:
* select(network fd) until data is available
* read 8KB
* stat(filesystem fd)
* write 8 KB
* some book keeping involving some more syscalls

jgreco · Jun 28, 2013

That's fine, but that's not the problem you previously described. You described "issuing disk IO in very small chunks of 8KB," not "the server's I/O event loop is processing 8KB chunks of client data synchronously/one at a time."

On a small memory system, there can be compelling reasons to keep a small buffer size. Unfortunately, naive code typically locks that in. Since there can be design complications on both the client-facing and system-facing side, it is more ideal to have some flexibility, especially where you have a filesystem that has a substantially larger block size. Me, I've always liked write coalescing with writev, except that I've found most systems have kind of a smallish MAXIOV (256, 1024, etc). But if you write the code right, it allows your program to support a single 8K buffer on a system with scarce resources, or a much larger buffer on resource-rich systems, all with a little adjustment of the number of iovec's you hand off...

"btdt" ...

So I'm curious what the purpose of the stat is in the loop you describe. I mean, I could see utimes() for curiously anal-retentive code that didn't understand that the lack of atomicity didn't actually promise the result that the coder seeked, but my insufficiently caffeinated and half-asleep brain can't puzzle out the stat.

aufalien · Mar 22, 2014

Thought I'd add to this in simply getting ppls attention, this is a FANTASTIC post. Would love to see it sicky'd if you will.

Flash2k6 · Mar 23, 2014

Matt: Did you ever get better performance out of your FreeNAS box? I'd love to know your final tweaked results.

Important Announcement for the TrueNAS Community.

Write Performance Issues mid-level ZFS setup

Matt Zell

Cadet

jgreco

Resident Grinch

Matt Zell

Cadet

jgreco

Resident Grinch

cyberjock

Inactive Account

Matt Zell

Cadet

slowfranklin

Cadet

slowfranklin

Cadet

Matt Zell

Cadet

slowfranklin

Cadet

jgreco

Resident Grinch

slowfranklin

Cadet

jgreco

Resident Grinch

aufalien

Patron

Flash2k6

Dabbler

Similar threads