Slow write speeds on a ZFS setup and a gigabit network

Status
Not open for further replies.

childersc

Cadet
Joined
Jun 29, 2012
Messages
9
Ok,

I am not sure where my bottleneck is. Any help would be greatly appreciated.

These are the drives I have;

ST31000528AS - 1TB
ST31000528AS - 1 TB
WDC WD15EADS-00P8B0 1.5 TB
WDC WD15EADS-00P8B0 1.5 TB
Corsair CSSD-F40GB2 40GB

This is the processor information


[root@storage01] ~# sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu'
hw.machine: amd64
hw.model: Intel(R) Xeon(R) CPU X3440 @ 2.53GHz
hw.ncpu: 8
hw.machine_arch: amd64
[root@storage01] ~#

All my boxes are connected using a Cisco 3750G-24 Port switch. The network traffic between the storage box and the other boxes are all internal. Each machine has 2 gigabit cards non bonded.

Here is a network speed test between both machines.

Storage server listening

[root@storage01] ~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[ 4] local 10.0.0.2 port 5001 connected with 10.0.0.11 port 33263
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec

Node Transmitting


[root@node01 ~]# iperf -c 10.0.0.2
------------------------------------------------------------
Client connecting to 10.0.0.2, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.0.11 port 33263 connected with 10.0.0.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.10 GBytes 942 Mbits/sec
[root@node01 ~]#

---------------------------------------------------------------------------------
Node Listening

[root@node01 ~]# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 10.0.0.11 port 5001 connected with 10.0.0.2 port 54556
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.1 sec 1.10 GBytes 936 Mbits/sec

Storage Server transmitting


As you can see with the network speeds everything checks out.

The disk's are setup in a RAID Z. I am only useing 2 of the 1 TB and 1 of the 1.5 TB drives, the other 1.5TB drive is connected but is now having problems.

The 40GB SSD drive is setup as my ZIL drive.

The server also has 4GB of ECC registered ram.


Here is some dd tests

DD Writing to the NAS
[root@node01 san]# time dd if=/dev/zero of=/san/test.file bs=1MB count=100
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 16.5313 s, 6.0 MB/s

real 0m16.534s
user 0m0.000s
sys 0m0.128s


DD Reading from the NAS writing to the NODE

[root@node01 san]# dd if=/san/test.file of=/dev/null bs=1MB
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 0.853853 s, 117 MB/s

[root@node01 san]# dd if=/san/test.file of=/root/testfile bs=1MB
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 0.0817032 s, 1.2 GB/s
[root@node01 san]#
 

childersc

Cadet
Joined
Jun 29, 2012
Messages
9
protosd,

everything I have read said I should use a zil drive separate of the array for faster write's, unfortunately this is also proving to not happen.

This machine isnt in production so changes can be made. I just wish to speed up my write's somehow. I have also tried the ugly UFS which also had slow writes.
 

ProtoSD

MVP
Joined
Jul 1, 2011
Messages
3,348
There is a bug/feature with this version of ZFS that will cause you to lose your pool if your ZIL goes offline or gets detached. A few people have discovered this the hard way. With your hardware, ZIL isn't going to make much if any difference. More RAM is what you need instead of ZIL.

There are a couple of tuning parameters, like txg.timeout, or write.limit.overide that I think might help. You'll need to search the forums, those variable names are not correct, they're just off the top of my head.

You didn't mention which version FreeNAS you're using. If you try 8.2 Beta-4, you could try enabling Autotune just to see what values it comes up with, but I'm not sure if it tunes those variables I mentioned.
 

survive

Behold the Wumpus
Moderator
Joined
May 28, 2011
Messages
875
Hi childersc,

A couple of thoughts.....

Get the dodgy drive replaced. There's no telling exactly what sort of havoc that's causing even if it's not actually part of the pool...if it's not in the pool disconnect it!

Get more memory. ZFS loves RAM, it craves RAM and will work much better if you give it enough RAM. The X3440 is a Lynnfiled proc so I would think it takes Registered DDR3 which is surprisingly cheap these days.

For what it's worth I don't think there's any way that what you are seeing right now is all you can expect out of the box....for grins, why not delete the existing pool and make a new pool just using the SSD and see what sort of speed you see.

-Will
 

childersc

Cadet
Joined
Jun 29, 2012
Messages
9
survive,

I will give that a shot. I was thinking it MAY be a drive related issue.

As far as the failing HDD I already pulled it, unfortunately this was doing this even before the drive failed. I will update in a bit once i have created a new pool containing only the ssd to see if there is a performance difference.

As far as memory I plan to move this box up to the 16gb ECC supported, but for now I just want to do some testing. I don't think memory is causing the issue with the write speeds though because even without the ZFS file system in place, when doing this with UFS I still get the slow write speeds.

Oh, this is the system information for the other user who requested it.

FreeNAS Build FreeNAS-8.0.4-RELEASE-p3-x64 (11703)
Platform Intel(R) Xeon(R) CPU X3440 @ 2.53GHz
Memory 4073MB
System Time Mon Jul 9 00:09:28 2012
Uptime 12:09AM up 4:35, 1 user
Load Average 0.00, 0.00, 0.00
OS Version FreeBSD 8.2-RELEASE-p9
 

childersc

Cadet
Joined
Jun 29, 2012
Messages
9
Ok,

So I created the ZFS pool containing only the SSD drive. Here are the results.

[root@node01 home]# time dd if=/dev/zero of=/san/test.file bs=1MB count=100
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 0.949327 s, 105 MB/s

real 0m0.977s
user 0m0.000s
sys 0m0.132s
[root@node01 home]# time dd if=/dev/zero of=/san/test.file bs=1MB count=1000
1000+0 records in
1000+0 records out
1000000000 bytes (1.0 GB) copied, 9.62531 s, 104 MB/s

real 0m9.649s
user 0m0.001s
sys 0m1.122s


Also, here is my DMESG output for all the drives attached.

ada0: <ST31000528AS CC37> ATA-8da0 at umass-sim0 bus 0 scbus6 target 0 lun 0
da0: < 1100> Removable Direct Access SCSI-4 device
da0: 40.000MB/s transfers
da0: 3810MB (7802880 512 byte sectors: 255H 63S/T 485C)
SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <ST31000528AS CC37> ATA-8 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada2 at ahcich3 bus 0 scbus3 target 0 lun 0
ada2: <WDC WD15EADS-00P8B0 01.00A01> ATA-8 SATA 2.x device
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
ada3 at ahcich4 bus 0 scbus4 target 0 lun 0
ada3: <Corsair CSSD-F40GB2 1.1> ATA-8 SATA 2.x device
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 38164MB (78161328 512 byte sectors: 16H 63S/T 16383C)



Clearly having the SSD as the only drive in the pool the write speeds are perfect. Obviously having the SSD drive as my only drive is not feasible. Is there something wrong with my 3 HDD's that I am maybe missing?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Check with iostat (or the newer interactive "gstat") to see how busy your disks are during write. There's lots of room for ZFS performance unhappiness due to various factors. Adding more memory can cause problems too, though these seem to be capable of being addressed via tuning. See my bug report #1531 for example.
 

childersc

Cadet
Joined
Jun 29, 2012
Messages
9
Jgreco,

Can you link me to that?

I will check the iostats again tomorrow I forgot to post that too, however none of these systems are production, and the only time anything pops up with iostats is when I try to write to the server from node01
 

survive

Behold the Wumpus
Moderator
Joined
May 28, 2011
Messages
875
Hi childersc,

What's the controller?

Is it just me or is it odd that the drives are coming up as "PIO"?

If you are using the on-board SATA ports be sure they are running in "AHCI" mode.

-Will
 

childersc

Cadet
Joined
Jun 29, 2012
Messages
9
survive,

They are all connected via the onboard SATA ports. This is a Supermicro X8SIL-F motherboard. I did have the adaptec controller enabled, I have disabled that and put the drives into AHCI mode. However they are still showing up as PIO in the dmesg

The controller is reporting as

ahci0 <Intel 5 Series/3400 Series AHCI SATA controller>
 

childersc

Cadet
Joined
Jun 29, 2012
Messages
9
jgreco,

Were you able to resolve your issue or is this something you're also still having?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
My analysis of it suggests that some ZFS implementation choices result in Unexpected Behaviours(tm), and that FreeNAS, as a downstream consumer of the design, isn't directly at fault, but is also unable to actually correct the issue, because it's designed to work that way.

You have a "transaction group" (txg) in ZFS that's basically a bunch of stuff that gets written out all at once; in FreeNAS, the size of a transaction group is decided by available memory (actually something based off L2ARC size IIRC) and is limited to a 30 second "build-up" time. Problem is, if you have a machine with a good amount of memory, like 32GB, and slowish disks, like my 2005-era Barracuda 400GB's, and you're using a slow strategy like RAIDZ2, you have a perfect tsunami of factors where even sequential writes can cause incredibly large txg's to be written, and they write fairly slowly, so your system hangs ... in my case, for minutes at a time.

Now, really, the correct solution to this is for the system to plan its I/O better on my behalf. So I include some related messages and thoughts.

http://osdir.com/ml/freebsd-questions/2010-03/msg01144.html

"Zfs tries to learn the write bandwidth so that it can tune the TXG size up to 5 seconds (max) worth of writes. If you have both large memory and fast storage, quite a huge amount of data can be written in 5 seconds."

Well, as a longtime storage guy, I can tell you that that *could* work, but trying "to learn the write bandwidth" is a heinously difficult problem. For example, if you write sequential data to a disk, it will go about as fast as it is capable of. Let's just say, for example, 100MB/sec. If you then switch to writing random blocks, your typical hard disk drive is maybe capable of 100 IOPS per second, 512 bytes per sector, that's 52KB/sec for a pathological case. So any learning algorithm that doesn't take into account data locality is going to only work sometimes and maybe not as expected.

It gets worse. The problem here was understood at some level by both Solaris and FreeBSD developers, but the twiddleable picked to "fix" it is on the wrong side of the equation:

http://osdir.com/ml/freebsd-questions/2010-03/msg01146.html

write_limit_override specifies a maximal amount of data. But that's really not likely to be the right tunable for many workloads, due to the problem I outlined above. What you probably want is something along the lines of setting an upper limit to the number of zones being written to the disk; writing many blocks to a single zone (blocks with strong locality to each other) should be much faster than writing many zones. And that needs to really be figured on a per-component-device basis. Which is really messy.

http://mail.opensolaris.org/pipermail/zfs-discuss/2011-June/049102.html

You can read more about the implementation concept here:

https://blogs.oracle.com/roch/entry/the_new_zfs_write_throttle

Anyways, you can largely "fix" (but not actually FIX) the problem through tuning, but you'll be tuning for a particular workload and it seems that it is always possible to come up with workloads that negatively impact a particular configuration. Just going for the "crazy random writes all over the place" is a workload that seems to be most effective at bringing out latencies in any HDD based ZFS config I've tried. In fairness, any other filesystem has trouble with these too, because it is the slowness of the underlying devices that causes a problem, but ZFS is particularly ugly due to the "let's lock up the application doing the writing until the txg is flushed" behaviour, because this causes iSCSI initiators to get Really Ticked Off, disconnect, and then be unable to reconnect. Bleah.

So the tradeoff you can make is to lower the 30 seconds down to 5 (the new default anyways) or less, except that this results in decreased performance. And/or you can set a lower write_limit_override, which also results in decreased performance.

Decreased performance isn't always a bad thing, by the way, it's just a little frustrating. :smile:
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
I would test all the drives directly first. Assuming the spinning drives are coming up the same as the previous post, run the following dd commands, from the [thread=981]performance sticky[/thread], in 3 SSH sessions, one per session at the same time.

[size=+1]This will destroy data on your disks.[/size] Only run it on disks not in an array.
Code:
dd if=/dev/zero of=/dev/ada0 bs=2048k count=50k
dd if=/dev/zero of=/dev/ada1 bs=2048k count=50k
dd if=/dev/zero of=/dev/ada2 bs=2048k count=50k


Note the times and follow it by running the tests against each drive singly. E.G.
Code:
dd if=/dev/zero of=/dev/ada0 bs=2048k count=50k
No need to test the SSD as you know already that it performs properly.
 

childersc

Cadet
Joined
Jun 29, 2012
Messages
9
Well,

This is interesting.

One drive finished, the other 2 drives are not done...


[root@storage01] ~# dd if=/dev/zero of=/dev/ada2 bs=2048k count=50k

51200+0 records in
51200+0 records out
107374182400 bytes transferred in 11575.956841 secs (9275620 bytes/sec)
[root@storage01] ~#
[root@storage01] ~#


How long is this specific one suppose to run?
 

childersc

Cadet
Joined
Jun 29, 2012
Messages
9
Ok, on my 1.5 TB drive being the only drive in the pool (as a test) I get the following results on my node via NFS


[root@node01 home]# time dd if=/dev/zero of=/san/test.file bs=1MB count=100
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 0.898948 s, 111 MB/s

real 0m0.924s
user 0m0.000s
sys 0m0.090s
[root@node01 home]# time dd if=/dev/zero of=/san/test.file bs=1MB count=1000
1000+0 records in
1000+0 records out
1000000000 bytes (1.0 GB) copied, 523.778 s, 1.9 MB/s

real 8m43.803s
user 0m0.000s
sys 0m0.757s
[root@node01 home]# time dd if=/dev/zero of=/san/test.file bs=1MB count=100
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 0.899186 s, 111 MB/s

real 0m1.108s
user 0m0.000s
sys 0m0.165s
[root@node01 home]#
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
How long is this specific one suppose to run?
It takes a bit less than 10 minutes running it directly on my drive. If I run it on both of my drives at the same time it takes... a bit less than 10 minutes.

Did you bother to run the tests like I suggested? You were supposed to run the 3 tests, if you have 3 spinning drives still, concurrently which will test how the system uses the drives all at once. It will use all the drives at once in a raidz array.

You were then supposed to run the test against each drive individually. That would show you if there was some issue with a particular drive and whether or not there is a config issue with using all drives at the same time.

Have you done that? :confused: If so what were the results.

We aren't interested in NFS tests yet as that throws a great many more variables into the mix. Also, you should read through the [post=3947]first post[/post] of the performance sticky. Your 100MB and even the 1GB file are too small to be testing with.

Are you running the latest bios rev on the X8SIL-F?
 
Status
Not open for further replies.
Top