RAIDz Suggested Number of Drives For Maximum Performance

Status
Not open for further replies.

bollar

Patron
Joined
Oct 28, 2012
Messages
411
I've been trying to understand why there are recommended numbers of drives for ZFS RAIDz Vdevs. I found this thread: What number of drives are allowed in a RAIDZ config? which suggests this is because ZFS writes 128KIB blocks and drives in other than the recommended n^2-parity cause the system to slow down because it can't write evenly across disks. This appears to be confirmed in this thread: Weird raidz1 and raidz2 performance with 4 drives - any explanation?

One of the things I hope to test with my benchmark project is the impact of this at larger array sizes, but in the meantime, here is the matrix for all RAIDz levels. Good array sizes (N^2) are in bold.

Level
Drives123
364128na
44364128
5324364
6263243
7212632
8182126
9161821
10141618
11131416
12121314
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Okay, having looked at this and the links several times, I can't quite figure out ... the number in the boxes is supposed to be ... ideal block size?

:confused: Feeling particularly dumb... :confused:
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yep because the data being written is a factor of the 128k stripe but more importantly falls on the 4k sector line. It was less important when hard drives had a 512byte sector.
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
I think it would be interesting to see if a "zfs set recordsize=" could improve performance in some of these cases, especially RAIDz1 and four drives, which has to be a fairly common configuration.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Are you wanting to make it smaller or larger? I'm really grasping for straws at how it would affect performance at all.

Quote from http://www.princeton.edu/~unix/Solaris/troubleshoot/zfs.html

The recordsize parameter can be tuned on ZFS filesystems. When it is changed, it only affects new files. zfs set recordsize=size tuning can help where large files (like database files) are accessed via small, random reads and writes. The default is 128KB; it can be set to any power of two between 512B and 128KB. Where the database uses a fixed block or record size, the recordsize should be set to match. This should only be done for the filesystems actually containing heavily-used database files.

In general, recordsize should be reduced when iostat regularly shows a throughput near the maximum for the I/O channel. As with any tuning, make a minimal change to a working system, monitor it for long enough to understand the impact of the change, and repeat the process if the improvement was not good enough or reverse it if the effects were bad.

For most of us, we're using files that are bigger than 128KB, so shrinking that sounds like a bad idea. This sounds like it would be good for a database situation where you KNOW the data blocks are ALWAYS 4KB or some other known size and you can adjust ZFS to match. But for any files bigger than 128KB it seems like you'd want it to be as big as possible. Hence the default is 128KB.
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
Are you wanting to make it smaller or larger? I'm really grasping for straws at how it would affect performance at all.

I don't know the answer, but I'm curious to know if changes to recordsize might improve write performance in those known suboptimal configurations, such as RAIDz1 and four drives. The answer is probably not, but hadn't found a reference that said so definitively.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I don't know the answer, but I'm curious to know if changes to recordsize might improve write performance in those known suboptimal configurations, such as RAIDz1 and four drives. The answer is probably not, but hadn't found a reference that said so definitively.

I would say no. That doesn't change the stripesize. It changes the recordsize.

Even if it did what you think it does, what would you change it to? It still has to be a power of 2 between 512B and 128KB. The fact that you have an odd number of disks means you'll never get an even multiple of 4k for your sectors.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok, maybe I misunderstood recordsize. But check out this thread. In particular post #2.

RAID-Z is somewhat odd; it is more like RAID3 than RAID5 really. To avoid confusion, let me explain on how i understand this to work:

Traditional RAID
In traditional RAIDs we know stripesize; normally 128KiB. Depending on the stripe width (number of actual striped data disks) the 'full stripe block' would be <data_disks> * <stripesize> = full stripe block. In RAID5 the value of this full stripe block is very important:

1) if we write exactly the amount of data of this full stripe block, the RAID5 engine can do this at very high speeds, theoretically the same as RAID0 minus the parity disks.

2) if we write any other value that is not a multiple of the full stripe block, then we have to would have to do a slow read+xor+write procedure which is very slow.

Traditional RAID5 engines with write-back essentially build up a queue (buffer) of I/O requests and scan for full stripe blocks which can be written efficiently; and will use slower read+xor+write for any smaller or leftover I/O.

RAID-Z
RAID-Z is vastly different. It will do ALL I/O in ONE phase; thus no read+xor+write will ever happen. How it does this? It changes the stripe size so that each write request will fit in a full stripe block. The 'recordsize' in ZFS is like this full stripe block. As far as i know, you cannot set it higher than 128KiB which is a shame really.

So what happens? For sequential I/O the request sizes will be 128KiB (maximum) and thus 128KiB will be written to the vdev. The 128KiB then gets spread over all the disks. 128 / 3 for a 4-disk RAID-Z would produce an odd value; 42.5/43.0KiB. Both are misaligned at the end offset on 4K sector disks; requiring THEM to do do a read whole sector+calc new ECC+write whole sector. Thus this behavior is devasting on performance on 4K sector drives with 512-byte emulation; each single write request issues to the vdev will cause it to perform 512-byte sector emulation.

Of course, that's how he understands it, so he could be mistaken. If we REALLY wanted to see a performance increase with huge arrays with only ginormous files having a stripe bigger than 128KB might be very helpful. It really is a shame it's limited to 128KB. But at the end of the day you are still screwed with your 4k sectors never being mathematically capable of aligning with an odd number of disks like 3.
 
Status
Not open for further replies.
Top