Ok, so, now for some fun.
Got a BBU for this LSI2208. Hooked up a pair of 500GB Momentus XT's in RAID1. Turned on writeback. These drives in RAID1 are capable of around 100MB/sec.
Code:
File size set to 9216000 KB
Record Size 4 KB
Command line used: iozone -f /dev/mfid1 -s 9000m -r 4k -i 0 -i 1 -i 2
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random random bk wd record stride
KB reclen write rewrite read reread read write re ad rewrite read fwrite frewrite fread freread
9216000 4 87720 87529 87401 90037 749 883
iozone test complete.
Great except for random reads/writes, where the cache on the BBU is too small to be effective, and basically it was running at around 180 tps. But for sequential ops it was running >20K tps.
Code:
File size set to 921600 KB
Record Size 4 KB
Command line used: iozone -f /dev/mfid1 -s 900m -r 4k -i 0 -i 1 -i 2
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
921600 4 91248 87168 65833 90215 69391 6376
iozone test complete.
Works better with a larger percentage of the data being in cache, it was running around 1600 tps on the random.
Code:
File size set to 204800 KB
Record Size 4 KB
Command line used: iozone -f /dev/mfid1 -s 200m -r 4k -i 0 -i 1 -i 2
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
204800 4 97924 97344 81277 74404 82129 90283
iozone test complete.
But working primarily out of cache? Way cool speed.
Now the thing is, that also seems to be about the max speed that the controller can handle from a single process, latency and all that. The controller seems to peak out around 72000 tps if I flood it with cachework.
Code:
[root@freenas] ~# dd if=/dev/zero of=/dev/mfid1 bs=4096
^C4019701+0 records in
4019700+0 records out
16464691200 bytes transferred in 182.622291 secs (90157073 bytes/sec)
A SLOG device should mostly be writing sequentially, so maybe the poor random performance isn't a major issue (and small bursts of random writes perform very well anyways). And I would expect that for many uses, SLOG writes at a rate of 21,000 or 22,000 per second would be pretty acceptable. And that does seem to be a controller-dictated limit... with a larger block size, we bang up against the speed of the underlying drives (we were already doing 90MB/sec). Trying these tests with the SSD RAID1 also peaks out around 22,000 tps at 4KB, or 18,000 tps at 16KB.
So.
Assuming generally sequential ZIL writes, it appears that using hard disk with a write-back cache could get you substantial numbers of ZIL ops per second, possibly pushing nearly the capacity of a moderately fast hard drive.
I haven't looked to determine whether the write path in ZFS would allow concurrency to happen; if not, then it appears that there might be a practical cap (in my case ~21K IOPS) with this technique. Otherwise, it would appear likely that use of SSD would let you push harder and get up past that point.
So I guess now I have to go cram some disks in this and make a pool and do some tests.
And... I'm stunned. Wow.
Code:
tty mfid1 cpu
tin tout KB/t tps MB/s us ni sy in id
0 44 0.00 0 0.00 0 0 0 0 100
0 133 0.00 0 0.00 0 0 0 0 100
0 45 0.00 0 0.00 0 0 0 0 100
0 45 35.93 755 26.50 0 0 6 2 92
0 44 35.90 2975 104.30 0 0 20 7 73
0 46 35.94 3084 108.24 0 0 19 8 72
0 46 35.93 3060 107.35 0 0 20 9 71
0 46 35.93 3041 106.70 0 0 20 10 70
0 46 35.93 2956 103.71 0 0 25 7 68
0 46 35.93 3017 105.87 0 0 25 6 69
0 46 35.96 3140 110.28 0 0 23 8 68
0 46 35.92 3103 108.84 0 0 22 9 70
0 46 35.94 3116 109.37 0 0 21 11 68
0 46 35.96 2981 104.69 0 0 26 9 65
0 46 35.94 3124 109.62 0 0 21 10 69
0 46 35.91 2723 95.50 0 0 18 8 75
0 45 35.92 2446 85.83 0 0 15 7 78
0 45 35.92 2384 83.61 0 0 14 8 78
0 45 35.90 2252 78.95 0 0 22 8 70
0 45 35.93 2358 82.74 0 0 19 9 72
0 45 35.96 2410 84.61 0 0 17 6 77
tty mfid1 cpu
tin tout KB/t tps MB/s us ni sy in id
0 45 35.93 2342 82.17 0 0 16 7 78
0 133 35.95 2430 85.31 0 0 15 6 78
tty mfid1 cpu
tin tout KB/t tps MB/s us ni sy in id
0 46 35.93 2076 72.84 0 0 18 6 76
0 133 35.94 2226 78.12 0 0 15 7 78
0 45 35.97 2013 70.70 0 0 13 5 82
0 45 35.94 2661 93.41 0 0 17 7 76
0 45 35.93 2215 77.73 0 0 16 5 78
That's the SLOG device. On the ESXi host, I'm ssh'd in and I did
Code:
/vmfs/volumes/256d35e6-3b3b14a0 # dd if=/dev/zero of=file2 bs=1048576 count=2048
2048+0 records in
2048+0 records out
/vmfs/volumes/256d35e6-3b3b14a0 #
Now if you look at the iostat output, you see that the write starts off really aggressively at 100-110MB/sec, but you can see the writeback cache fill up and it drops to 70-85MB/sec after about 10 seconds.
That's ... awesome. It would be unrealistically expensive to go out and buy a BBU RAID controller just for this purpose, but if you can get a system board that already integrates one for a modest price differential over a different server board, this is something to really think about.