Poor NFS / vmware writes even with mirrored SSD ZIL

dwright1542 · May 29, 2013

I've got a supermicro server Quad Xeon, 16GB RAM running 8 72001TB enterprise SATA drives, 10GigE to a ESX 4.1 box.

Disabling sync writes gives me 400meg+ writes.

I've got dual Samsung 830's, mirrored for the zil.

I'm getting 10meg writes (even large sequential) with that enabled, and sync on. Without the SSD's I get 1 meg.

I've been reading all day....Can't figure out what went wrong. Even tried the no_write_throttle.

Any ideas?

....more info. dd is showing 70k at 4k blocks, and 250k at 1024k blocks, so it's not the SSDs. iperf shows 8gig or so across the lan.

jgreco · May 29, 2013

Ditch the mirror and try a single SSD for SLOG.

dwright1542 · May 29, 2013

jgreco said:
Ditch the mirror and try a single SSD for SLOG.

Yup did that. gstat is showing 90+ util, but only getting 10M. Same exact speeds.

jgreco · May 29, 2013

So what does "meg" mean to you?

dwright1542 · May 29, 2013

jgreco said:
So what does "meg" mean to you?

MB/sec

jgreco · May 30, 2013

Even your answer was ambiguous. Now listen, I'm not trying to pick on you, I'm sure YOU know what you mean, but to get useful help, you need to help other people know what you mean. "MB/sec" is supposed to mean "megabytes per second" but is frequently corrupted to mean "megabits per second". "MBit/s" and "mbit/s" (or /sec), and "MByte/s" and "mbyte/s" (or /sec) clearly and most importantly unambiguously convey the right thing.

Because this isn't helpful:

dd is showing 70k at 4k blocks, and 250k at 1024k blocks

because I have to sit here and reverse-engineer your numbers to see if I can figure out what you actually saw. 70 KBytes/sec with 4096-byte blocks doesn't make a ton of sense, and was that reading or writing or what?

Now, the comment that I _can_ make is this:

Look at what's happening. Run "iostat ${dev} 1" on your SLOG device under load and check the "KB/t" size, probably ~36KBytes. Also see how many per second, "tps". That gives you a realistic idea of the sort of load the device is under. You can then use dd to establish what the practical maximum tps could be; you won't get that in the real world, of course, but it is often less than you'd think.

As a point of comparison, our poor little N36L here is eating backups from ESXi via NFS. I had it running "sync=disabled" and it was sucking it down at ~400-500Mbit/s, which has dropped to around ~220Mbit/s with an Intel 320 MLC acting as SLOG, underprovisioned to allow it a better chance at maintaining speeds. Still, the addition of the SLOG is adding latency to the whole process. Not really unexpected.

If you really want to be able to accelerate NFS, you have to reduce latency. SLC is better for that. Eliminating the SATA/SAS bus is better for that.

Does the Samsung 830 actually have a supercapacitor? That just struck me...

dwright1542 · May 30, 2013

jgreco said:
Even your answer was ambiguous. Now listen, I'm not trying to pick on you, I'm sure YOU know what you mean, but to get useful help, you need to help other people know what you mean. "MB/sec" is supposed to mean "megabytes per second" but is frequently corrupted to mean "megabits per second". "MBit/s" and "mbit/s" (or /sec), and "MByte/s" and "mbyte/s" (or /sec) clearly and most importantly unambiguously convey the right thing.

Because this isn't helpful:

because I have to sit here and reverse-engineer your numbers to see if I can figure out what you actually saw. 70 KBytes/sec with 4096-byte blocks doesn't make a ton of sense, and was that reading or writing or what?

Now, the comment that I _can_ make is this:

Look at what's happening. Run "iostat ${dev} 1" on your SLOG device under load and check the "KB/t" size, probably ~36KBytes. Also see how many per second, "tps". That gives you a realistic idea of the sort of load the device is under. You can then use dd to establish what the practical maximum tps could be; you won't get that in the real world, of course, but it is often less than you'd think.

As a point of comparison, our poor little N36L here is eating backups from ESXi via NFS. I had it running "sync=disabled" and it was sucking it down at ~400-500Mbit/s, which has dropped to around ~220Mbit/s with an Intel 320 MLC acting as SLOG, underprovisioned to allow it a better chance at maintaining speeds. Still, the addition of the SLOG is adding latency to the whole process. Not really unexpected.

If you really want to be able to accelerate NFS, you have to reduce latency. SLC is better for that. Eliminating the SATA/SAS bus is better for that.

Does the Samsung 830 actually have a supercapacitor? That just struck me...

No supercap. This is just test for right now for moving to ZFS instead of a hardware battery backed RAID setup: we thought the SSD's would actually be faster. And I'm not surprised that my numbers make so sense...you're right....terrible info.

Ok so while doing a write test I ran iostat ada1 1

tin tout KB/t tps MB/s us ni sy in id
0 44 26.12 432 11.01 2 0 8 0 89
0 132 25.56 458 11.43 2 0 3 1 95
0 44 26.13 445 11.35 1 0 3 1 94
0 44 26.36 450 11.57 1 0 4 1 95
0 44 26.02 428 10.87 1 0 3 1 95
0 44 25.71 391 9.81 1 0 9 2 88
0 44 25.91 445 11.25 1 0 4 1 94

[root@freenas ~]# dd if=/dev/zero of=/dev/ada1 bs=4k count=20k
20480+0 records in
20480+0 records out
83886080 bytes transferred in 1.179267 secs (71134090 bytes/sec)

[root@freenas ~]# dd if=/dev/zero of=/dev/ada1 bs=1024k count=20k
20480+0 records in
20480+0 records out
21474836480 bytes transferred in 93.898528 secs (228702590 bytes/sec)
[root@freenas ~]#

jgreco · May 30, 2013

Ok, NOW that's making more sense. Really, my guess is that you're bumping up against latency and the practical number of IOPS you can sustain through the device, possibly combined with bad pool design, which we haven't even looked at. Are you doing RAIDZ or mirroring, for example? 4096-byte sectors?

But back to the SSD for a bit.

Is it underprovisioned? If not, you really ought to do that. You can't fix it easily, you've got to use the Samsung Magician tool to securely wipe the drive, which zeroes it out and returns all the pages to zero. This lets the drive controller know that all the pages are actually unallocated. Then carve out a small partition to use for SLOG. This won't make a difference now, but it will help to maintain speeds later as you take advantage of the controller's wear-leveling and error protection.

And if you don't care enough about data protection to get the right kind of SSD for the job (i.e. something with power protection), have you considered that perhaps your risk tolerance is such that you could just live with sync=disabled and a good set of backups?

You can also cheat and gain ZIL acceleration in other ways. I've got an LSI2208 RAID controller with BBU on this box I've been playing with. It has two 500GB Seagate Momentus XT's mirrored in RAID1. Using a small partition on that turns out to be a fairly low-latency SLOG, except that its throughput is quickly limited to the speed of the Momentus drives (the 1GB RAID cache soaks up the first few seconds at massive speed though).

But fundamentally you have to remember that the process has to go:

ESXi NFS client request -> TCP -> Network -> TCP -> NFS server -> ZFS -> ZIL -> write request to physical device -> acknowledge -> ZFS returns -> NFS server acknowledges -> TCP/Network/TCP returns ack -> NFS client is happy and moves on

so there are lots of layers in there and even with ideal everything, speeds are often substantially lower than what you might wish for.

dwright1542 · May 30, 2013

jgreco said:
Ok, NOW that's making more sense. Really, my guess is that you're bumping up against latency and the practical number of IOPS you can sustain through the device, possibly combined with bad pool design, which we haven't even looked at. Are you doing RAIDZ or mirroring, for example? 4096-byte sectors?

But back to the SSD for a bit.

Is it underprovisioned? If not, you really ought to do that. You can't fix it easily, you've got to use the Samsung Magician tool to securely wipe the drive, which zeroes it out and returns all the pages to zero. This lets the drive controller know that all the pages are actually unallocated. Then carve out a small partition to use for SLOG. This won't make a difference now, but it will help to maintain speeds later as you take advantage of the controller's wear-leveling and error protection.

And if you don't care enough about data protection to get the right kind of SSD for the job (i.e. something with power protection), have you considered that perhaps your risk tolerance is such that you could just live with sync=disabled and a good set of backups?

You can also cheat and gain ZIL acceleration in other ways. I've got an LSI2208 RAID controller with BBU on this box I've been playing with. It has two 500GB Seagate Momentus XT's mirrored in RAID1. Using a small partition on that turns out to be a fairly low-latency SLOG, except that its throughput is quickly limited to the speed of the Momentus drives (the 1GB RAID cache soaks up the first few seconds at massive speed though).

But fundamentally you have to remember that the process has to go:

ESXi NFS client request -> TCP -> Network -> TCP -> NFS server -> ZFS -> ZIL -> write request to physical device -> acknowledge -> ZFS returns -> NFS server acknowledges -> TCP/Network/TCP returns ack -> NFS client is happy and moves on

so there are lots of layers in there and even with ideal everything, speeds are often substantially lower than what you might wish for.

thanks for your input, it's much appreciated.

It IS underprovisioned, 256G SSD provisioned for 20G zil.

they are 4096byte sectors.

Mirrored RAID, and with sync=disabled, the pool can write 300Mbytes / sec.

Still 10MBytes /sec sequential from a drive that can do 200Mbytes/sec? That's a heck of a penalty,

I sure would be getting a supercap based SSD, however, that's as pricey as a good NVRAM backed RAID controller, especially since the test was against an LSI with Cachecade. This really was proof of concept, so I'm not heartbroken that it doesn't work. It just seems like I missed something basic in the setup for it to perform so abysmally.

jgreco · May 30, 2013

I agree that the 10MBytes/sec seems too low, but it isn't clear what is going wrong for you.

What sort of local speeds do you get if you set sync=always and try a local dd write?

cyberjock · May 30, 2013

jgreco said:
Is it underprovisioned? If not, you really ought to do that. You can't fix it easily, you've got to use the Samsung Magician tool to securely wipe the drive, which zeroes it out and returns all the pages to zero. This lets the drive controller know that all the pages are actually unallocated. Then carve out a small partition to use for SLOG. This won't make a difference now, but it will help to maintain speeds later as you take advantage of the controller's wear-leveling and error protection.

Technically, for both SLC and MLC, if the flash memory is erased, it's full of 1's.

jgreco · May 30, 2013

Quiet, David Bowman. That little detail is, IIRC, usually only visible to the controller.

cyberjock · May 30, 2013

jgreco said:
Quiet, David Bowman. That little detail is, IIRC, usually only visible to the controller.

Hmm. I'll have to test this. I have a spare Intel SSD lying around!

cyberjock · May 31, 2013

So I did a secure internal erase of a spare Intel 120GB G2 SSD. And then did a sector read. The drive is full of 0's. So no fun!

pbucher · May 31, 2013

I've seen something similar when I compared a STEC SSD drive with a STEC ZeusRAM in my setup. When testing from a Linux VM hosted on a NFS share(sync=standard) I can get 100MBytes/S with the ZeusRAM(while sorta disappointing considering the hardware involved but acceptable for my production system since it can deliver the IOPS I need), but when I replaced the ZeusRAM with the SSD drive the performance went through the floor. In fact I got better performance with no ZIL device at all.

I've been meaning to do more through investigation of it all and try some other configurations(such as putting the ZIL device on a dedicated controller). But I don't believe it's a ESXi or NFS issue. Also I did notice that just raw benchmarking of the drives is that they all performance much worse when you move away from the nice 4K block sizes the manufacturers all use for the data sheets they post.

jgreco · May 31, 2013

pbucher said:
Also I did notice that just raw benchmarking of the drives is that they all performance much worse when you move away from the nice 4K block sizes the manufacturers all use for the data sheets they post.

That's consistent with what I've seen.

Important Announcement for the TrueNAS Community.

Poor NFS / vmware writes even with mirrored SSD ZIL

dwright1542

Cadet

jgreco

Resident Grinch

dwright1542

Cadet

jgreco

Resident Grinch

dwright1542

Cadet

jgreco

Resident Grinch

dwright1542

Cadet

jgreco

Resident Grinch

dwright1542

Cadet

jgreco

Resident Grinch

cyberjock

Inactive Account

jgreco

Resident Grinch

cyberjock

Inactive Account

cyberjock

Inactive Account

pbucher

Contributor

jgreco

Resident Grinch

Similar threads