Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.

Some differences between RAIDZ and mirrors, and why we use mirrors for block storage

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,783
Thanks
3,039
#1
ZFS is a complicated, powerful system. Unfortunately, it isn't actually magic, and there's a lot of opportunity for disappointment if you don't understand what's going on.

RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk for large blocks of data, compressing them, and storing the parity in an efficient manner. RAIDZ makes very good use of the raw disk space available when you are using it in this fashion. However, RAIDZ is not good at storing small blocks of data. To illustrate what I mean, consider the case where you wish to store an 8K block of data on RAIDZ3. In order to store that, you store the 8K data block, then three additional parity blocks... not efficient. Further, from an IOPS perspective, a RAIDZ vdev tends to exhibit the IOPS behaviour of a single component disk (and the slowest one, at that).

So we see a lot of people coming into the forums trying to store VM data on their 12 disk wide RAIDZ2 and wonder why their 12 disk 30 TB array sucks for performance. It's exhibiting the speed of a single disk.

The solution to this is mirrors. Mirrors aren't as good at making good use of the raw disk space (because you only end up with 1/2 or 1/3 the space), but in return for the greater resource commitment, you get much better performance. First, mirrors do not consume a variable amount of space for parity. Second, you're likely to have more vdevs. That 12 drive system we were just talking about will have 4 three-way mirrors or 6 two-way mirrors, which is 4x or 6x the number of vdevs. This translates directly to greatly enhanced performance!

Another substantial performance enhancement with ZFS is to maintain low pool occupancy rates.

For RAIDZ style file storage, it's commonly thought that performance will suffer once you pass the 80% mark, but this isn't quite right. It's a combination of fragmentation and occupancy that causes performance to suffer.

For mirrors, this is also true, but because the data being stored is often VM disk files or database files, it becomes more complicated. Because it is a copy-on-write filesystem, rewriting a block in a VM disk file causes a new block somewhere else to be allocated, and creates a hole where the old block was, when that block is freed (after any snapshots are released, etc). When writing new data, ZFS likes to allocate contiguous regions of disk to write its transaction groups. An interesting side effect of this is that if you are rewriting VM disk blocks 1, 5000, 22222, and 876543, these may actually be written as sequentially allocated blocks when ZFS dumps that transaction group to disk. A normal disk array would have to do four seeks to do those writes, but ZFS *may* be able to write them sequentially. Taken to its logical conclusion, when ZFS has massive amounts of free space available to work with, it can potentially be five or ten times faster at performing writes than a conventional disk array. The downside? ZFS will suffer if it lacks that free space.

If you want really fast VM writes, keep your occupancy rates low. As low as 10-25% if possible. Going past 50% may eventually lead to very poor performance as fragmentation grows with age and rewrites.

None of this helps with reads, of course, which over time become highly fragmented. ZFS typically mitigates this with gobs of ARC and L2ARC, which allow it to serve up the most frequently accessed data from the cache.
 

MtK

FreeNAS Experienced
Joined
Jun 22, 2013
Messages
471
Thanks
29
#2
Nice summary!
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,783
Thanks
3,039
#4
Hi

Is storage for VM's (or databases) not tpicaly better served by SSD?
Sure, but if you have a VM environment with 30TB of active VM data, that can be handled by an array of 24 6TB HDD's in mirror (72TB pool at ~~50% occupancy). Cost around $4800. Now the thing is, at ~50% occupancy, you might only be getting ~2000 write IOPS out of it, but if that's okay in your environment, ...

Compare that to SSD. Even assuming you used consumer grade SSD like the 850 Evo, to get a pool with ~80% occupancy of 30TB in RAIDZ2, aside from needing to know how exactly to do the blocksize so that you didn't lose excessive space to parity, we're talking a 40TB pool, or 44TB of raw SSD, or 22 2TB SSD's, or around $15000. And that's a risky proposition, because you're still only creating a single RAIDZ2 vdev, so you will be fighting interesting performance gotchas.

Going full-on SSD mirrors, you'd need 80TB of raw SSD, or 40 2TB SSD's, or around $26000. Now of course that will happily chew up anything you want to throw at it and will do so at a massive number of IOPS. But it's 5x more expensive than the HDD solution.
 

AlainD

FreeNAS Experienced
Joined
Apr 7, 2013
Messages
138
Thanks
2
#5
Sure, but if you have a VM environment with 30TB of active VM data, that can be handled by an array of 24 6TB HDD's in mirror (72TB pool at ~~50% occupancy). Cost around $4800. Now the thing is, at ~50% occupancy, you might only be getting ~2000 write IOPS out of it, but if that's okay in your environment, ...

Compare that to SSD. Even assuming you used consumer grade SSD like the 850 Evo, to get a pool with ~80% occupancy of 30TB in RAIDZ2, aside from needing to know how exactly to do the blocksize so that you didn't lose excessive space to parity, we're talking a 40TB pool, or 44TB of raw SSD, or 22 2TB SSD's, or around $15000. And that's a risky proposition, because you're still only creating a single RAIDZ2 vdev, so you will be fighting interesting performance gotchas.

Going full-on SSD mirrors, you'd need 80TB of raw SSD, or 40 2TB SSD's, or around $26000. Now of course that will happily chew up anything you want to throw at it and will do so at a massive number of IOPS. But it's 5x more expensive than the HDD solution.
Ok, that size is clearly out of SSD reach, except when it's worth the high price .
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,783
Thanks
3,039
#6
Ok, that size is clearly out of SSD reach, except when it's worth the high price .
Well, sure. But the thing to recognize about a lot of VM environments is that a lot of them consist of huge amounts of data at rest, because if you look at the average Windows or FreeBSD or Linux install, about 80% of it is rarely if ever touched. So in a VM environment, there's a hell of a lot of value if you separate the active applications and data on your VM's onto a separate virtual disk (backed by SSD) and then the base OS can be on a larger virtual disk that's backed by conventional HDD. If you do it that way, you get the benefits of SSD for the stuff that really matters. HDD will continue to be relevant for some time due to this. Hypervisor management systems like vSphere already have support baked in for storage profiles. It allows you to put the don't-care stuff in the cheap seats.
 

JMC

Newbie
Joined
Jun 2, 2016
Messages
11
Thanks
5
#8
What if your not interested in VM's?

If your primary purpose is plain old storage for a fileserver, with a lot of mixed file sizes, and the transport is CIFS/NFS and ownCloud?

I've tried to read a lot about this raidz vs mirrors debate, and of course you want performance, but my primary concern is the safety of my data, and I will sacrifice both performance and storage size (in that order) to get it.

I've read posts with conflicting viewpoints, and I am unsure if mirrors provide the same security of single vdev failure, than say raidz2 or 3?

Resilvering is my most pressing concern, and of course that is where I view mirrors as having the greater advantage (less time spent, lower change of total vdev failure?).

Edit1: Article that is tempting me on the argument for mirrors:
http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/

My fear is this, once you have a failed disk in a 3TB x 12 drive setup (36TB/18TB), if the next disk to go (before you're optimal again), is the opposite disk in that mirror, the whole pool is effectively lost?

Now you're looking at replacing both drives, starting from scratch and restoring from backup. Last time I had to restore a significant amount from backup, was 300GB, and that took about 8-9 hours, 18TB would of course be, worse...

The same disks in the 2 x 6 raidz2 setup, I can lose any 2 disks, in a single vdev and survive?

Sure, resilvering takes longer (a lot as I understand it), but at my bandwidth speed, so does restoring from backup.

As an aside, I saw a youtube video, where intel presented a new way of distributing the parity, and greatly speeding up resilvering on raidz; I'll go hunt for the link.

edit2: found it: https://www.youtube.com/watch?v=MxKohtFSB4M&index=7&list=PLaUVvul17xSedlXipesHxfzDm74lXj0ab

Thank you for posting this, and sorry for the long reply, I'm new here, and have yet to get a feel for the community.
 
Last edited:

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,194
Thanks
591
#9
I've read posts with conflicting viewpoints, and I am unsure if mirrors provide the same security of single vdev failure, than say raidz2 or 3?
That would depend on how your mirrors are setup. By default, it is a 2-way mirror, so it is less redundant than RaidZ2. However, you can do 3-way mirrors (heck maybe even 4-way or more). The downside is how much actual usable space you will be getting...
My fear is this, once you have a failed disk in a 3TB x 12 drive setup (36TB/18TB), if the next disk to go (before you're optimal again), is the opposite disk in that mirror, the whole pool is effectively lost?
Yes, but that is more along the lines of if any vDev that is part of a pool is lost; then the entire pool is lost.
The same disks in the 2 x 6 raidz2 setup, I can lose any 2 disks, in a single vdev and survive?
Yes, RaidZ2 can withstand two drives being lost (but then you are at the "edge of a cliff" if you don't get that addressed). So say you setup 3-way mirrors, in that sense you have the similar redundancy as RaidZ2; however you are using 2 of the drives for redundancy and only getting one drives worth of usable space...

There is always a balance between Speed, Redundancy and Space that you need to decide fits best for your design and use-case. I would suggest that you read Slideshow explaining VDev, zpool, ZIL and L2ARC for noobs! which should help you have a better understanding.
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,194
Thanks
591
#10
What if your not interested in VM's?

If your primary purpose is plain old storage for a fileserver, with a lot of mixed file sizes, and the transport is CIFS/NFS and ownCloud?
Then for the most part RaidZ2 or RaidZ3 would be fine.
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,783
Thanks
3,039
#11
What if your not interested in VM's?
For a smaller number of large files, RAIDZ2/3 is the clear winner.

If your primary purpose is plain old storage for a fileserver, with a lot of mixed file sizes, and the transport is CIFS/NFS and ownCloud?
For mixed file storage, it's a little less clear. Mirrors will work fine for the application but are more expensive. RAIDZ2/3 do less-well with a huge number of smaller files.

I've tried to read a lot about this raidz vs mirrors debate, and of course you want performance, but my primary concern is the safety of my data, and I will sacrifice both performance and storage size (in that order) to get it.
It isn't even that you "want" performance. A filer that performs at a snail's pace because the demands placed on it are just too great is useless. If we were to create a hypothetical single vdev RAIDZ2 filer with 24 8TB disks and ended up with a 176TB pool, you'd have a pool that could only support maybe a handful of moderate activity VM's despite the apparent massive IOPS capacity underneath in the raw pool devices. The vdev restriction gets you.

I've read posts with conflicting viewpoints, and I am unsure if mirrors provide the same security of single vdev failure, than say raidz2 or 3?
You're welcome to reference them here where I will happily disembowel them if appropriate.

So here's the thing. When most people hear "mirrors" what they're seeing in their head is a two-way mirror. And, yes, if you lose any single disk in a pool made of two-way mirror vdevs, you've lost redundancy. That is, there is some portion of your data in the pool that cannot be recovered if there are any failures (read errors, etc) in the other half of that mirror.

That might even matter a tiny little bit if it were the only option. However, it isn't. You can also go to three-way mirrors. That is, three disks with exactly the same content. Really paranoid, RAIDZ3 style? You can go to four-way mirrors. Lose any two disks without compromising redundancy. Beyond? Yes you can. Ridiculous amounts of redundancy if you truly want.

I think anything beyond four is probably stupid except perhaps in very specialized cases where you need immense amounts of read IOPS and cannot handle that with ARC/L2ARC.

The VM filer here had a simple design goal that losing a disk should not compromise redundancy, so we went with eight three-way mirror vdevs and two spare drives. So if we lose a disk it immediately starts a rebuild on spare. This is probably the sensible way to balance things out.

Resilvering is my most pressing concern, and of course that is where I view mirrors as having the greater advantage (less time spent, lower change of total vdev failure?).

Edit1: Article that is tempting me on the argument for mirrors:
http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/

My fear is this, once you have a failed disk in a 3TB x 12 drive setup (36TB/18TB), if the next disk to go (before you're optimal again), is the opposite disk in that mirror, the whole pool is effectively lost?
Yes. Or even if you merely have an unrecoverable read error, it could mean data is lost or damaged.

Now you're looking at replacing both drives, starting from scratch and restoring from backup. Last time I had to restore a significant amount from backup, was 300GB, and that took about 8-9 hours, 18TB would of course be, worse...

The same disks in the 2 x 6 raidz2 setup, I can lose any 2 disks, in a single vdev and survive?
In theory. However, once you've lost two disks in a single vdev, you've compromised redundancy, so you are highly reliant on perfect operation of the remainder of the disks for a rebuild operation. It might be better to think of RAIDZ2 as "safely able to tolerate a single drive loss." This is a pragmatic application of the "RAID5 dies in 2009" issue. You want to avoid compromising redundancy, because any checksum error or other read issue means ZFS needs to reconstruct that block. I'd propose that there's a reasonable argument to be had that this means you cannot allow things to degrade to a point where you have no redundancy. That implies that RAIDZ1 is unacceptable, and that RAIDZ2 is the minimum acceptable tier for RAIDZ. Likewise with mirrors, two-way mirrors lose redundancy upon loss of a drive, so three-way mirrors is the minimum acceptable tier for mirrors.

So if your data truly matters to you and you really don't want to risk losing it, I would strongly encourage you to adopt a "avoid compromising redundancy" strategy, which means RAIDZ2 or three-way mirrors as the minimum redundancy tiers.

Sure, resilvering takes longer (a lot as I understand it), but at my bandwidth speed, so does restoring from backup.

As an aside, I saw a youtube video, where intel presented a new way of distributing the parity, and greatly speeding up resilvering on raidz; I'll go hunt for the link.

edit2: found it: https://www.youtube.com/watch?v=MxKohtFSB4M&index=7&list=PLaUVvul17xSedlXipesHxfzDm74lXj0ab
For those of us who are audio-challenged on our browsing VM's: https://drive.google.com/file/d/0B4BF1vnv6p0-bGVrQXg3dmY1Uzg/view?safe=strict&pref=2&pli=1

Thank you for posting this, and sorry for the long reply, I'm new here, and have yet to get a feel for the community.
There's a lot of great folks around here who will be happy to set you straight (or debate) you on any given topic. However, also be aware that most things that can be said already have been, so do feel free to take advantage of the forum search feature.
 

JMC

Newbie
Joined
Jun 2, 2016
Messages
11
Thanks
5
#12
Thank you both for your reply, it is taken to heart ( especially the disembowling bit ;) )...

Mirfster, I had already read most of your recommended reading, cyberjocks and jgreco's entries a couple of times.
I hadn't seen your last 2 entries, and I look forward to consuming them later tonight!

I've also loaded the 9.10 manual that gpsguy has in his sig. into my kindle, and it is going to be my new bedtime story, so yes, trying hard, but admittedly playing catchup!

Jgreco, I wouldn't dream of allowing a state deterioate to the point of requiring perfect operation, but I have had a bit of brow sweat with raid 6 loosing a "secondary" drive during the rebuild of replacing the first, so yes, I agree completely, parity +1 is a must;

And it is indeed my simple ignorance, of thinking of only 2 way mirrors, as you and Mirfster correctly deduce, 3 way mirrors would bring the peace of mind I require, but as a hobbyist level user, the expense sets in quick.

Might I inquire, I have spare disks physically laying next to the server, sealed up. The thinking is, they'll see zero use, and minimal temperature change until needed.
Would you recommend adding them as actual spares, so the system can engage them when needed instead?
(I figure this post from 2014 is still true: https://forums.freenas.org/index.ph...-automatic-replace-bad-hdd.23198/#post-140189) - Mind you, I have easy access to the server.

And finally, I'm happy to be set straight, but nowhere near being able to debate anything yet :)
 
Last edited:

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,194
Thanks
591
#13
Might I inquire, I have spare disks physically laying next to the server, sealed up. The thinking is, they'll see zero use, and minimal temperature change until needed.
Would you recommend adding them as actual spares, so the system can engage them when needed instead?
Would depend upon what level of redundancy and "ease of access" you have IMHO. Normally, I would not use spares with a RaidZ2/RaidZ3 vDev configuration. To me that is akin to a waste of a perfectly good drive space. However, if the server is located off-site or somewhere the not easily physically accessible then a spare (or two) would be a safe consideration. Could also be nice if you are going on a long vacation..

Now, I also have another system that holds 12 drives and hosts VMs. That system is currently running 5 x 2-way mirrors (for IOPS) and two spares. This configuration fits my comfort zone, but YMMV.
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,783
Thanks
3,039
#14
Jgreco, I wouldn't dream of allowing a state deterioate to the point of requiring perfect operation,
And yet many people operate in that mode by default, their PC's blissfully storing data and trusting its regurgitation without so much as a second drive.

but I have had a bit of brow sweat with raid 6 loosing a "secondary" drive during the rebuild of replacing the first, so yes, I agree completely, parity +1 is a must;
I think it is a healthy state of mind. Plus, resources these days are *so* damn cheap.

And it is indeed my simple ignorance, of thinking of only 2 way mirrors, as you and Mirfster correctly deduce, 3 way mirrors would bring the peace of mind I require, but as a hobbyist level user, the expense sets in quick.
It's all relative. It wasn't that long ago that we were paying $1/GB for HDD space. As someone who does this from a business user perspective, I *could* justify doing pretty much whatever the heck I wanted to, but fundamentally I don't like signing off on anything that feels like I'm wasting money. So our mirrored VM filer uses $90 2TB 2.5" laptop drives instead of the 0 2TB 2.5" "enterprise" drives, one quarter the cost. Guessing about half the performance. The 3.5" drives are even cheaper of course but have more heat and density issues.

Might I inquire, I have spare disks physically laying next to the server, sealed up. The thinking is, they'll see zero use, and minimal temperature change until needed.
Would you recommend adding them as actual spares, so the system can engage them when needed instead?
For mirrors, or RAIDZ?

For RAIDZ it probably makes more sense to have them already running in the RAIDZ with a wider parity (3 instead of 2) in most cases. For mirrors, since you can detach a drive and shuffle as needed, there too, there's advantages in keeping the disk "in pool."

A drive in pool as spare probably makes more sense than a drive sitting on a shelf, except for the wear and tear issue.

(I figure this post from 2014 is still true: https://forums.freenas.org/index.ph...-automatic-replace-bad-hdd.23198/#post-140189) - Mind you, I have easy access to the server.

And finally, I'm happy to be set straight, but nowhere near being able to debate anything yet :)
No, zfsd will happily replace a clearly failed drive from an available configured spare. It does not do so for every case where you might want to replace a drive. It *might* also do it for some classes of problems where, as an admin, you might hold back a bit and wait to see what develops.
 
Joined
May 16, 2014
Messages
3,780
Thanks
740
#15
Joined
May 16, 2014
Messages
3,780
Thanks
740
#16

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,339
Thanks
367
#17
Might I inquire, I have spare disks physically laying next to the server, sealed up. The thinking is, they'll see zero use, and minimal temperature change until needed.
Keep in mind that burning in unused disks will take its time to complete. As an estimate: A single read pass using solnet-array-test-v2.sh takes about 7 hours for a WB Red 3TB.
https://forums.freenas.org/index.ph...esting-your-freenas-system.17750/#post-148773

A single SMART long test takes about 7 hours on such a disk as well, a badblocks run with a single pattern (write and read) takes about 14 hours.
https://forums.freenas.org/index.php?threads/how-to-hard-drive-burn-in-testing.21451/
https://forums.freenas.org/index.ph...eview-and-burn-in-question.21015/#post-121569

So plan several days for hard disk burn-in, depending on disk size, the number of tests you want to accomplish and your sleep-wake rhythm. This might be an argument pro hot spares, which already underwent the burn-in procedure.
 

JMC

Newbie
Joined
Jun 2, 2016
Messages
11
Thanks
5
#18
Keep in mind that burning in unused disks will take its time to complete. As an estimate: A single read pass using solnet-array-test-v2.sh takes about 7 hours for a WB Red 3TB.
https://forums.freenas.org/index.ph...esting-your-freenas-system.17750/#post-148773

A single SMART long test takes about 7 hours on such a disk as well, a badblocks run with a single pattern (write and read) takes about 14 hours.
https://forums.freenas.org/index.php?threads/how-to-hard-drive-burn-in-testing.21451/
https://forums.freenas.org/index.ph...eview-and-burn-in-question.21015/#post-121569

So plan several days for hard disk burn-in, depending on disk size, the number of tests you want to accomplish and your sleep-wake rhythm. This might be an argument pro hot spares, which already underwent the burn-in procedure.
That's actually a really good point, thank you.
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,783
Thanks
3,039
#19
Do you have any concerns about lack of ERC/TLER on those?
Sure.

If so, do you have a mitigation strategy?
Yes. Aggressively monitor for problems, and if a drive develops one, take corrective action on it with the five pound sledge while its replacement resilvers.

I just can't justify paying 4x the price for TLER. Besides, there's a terabyte of L2ARC on the box which virtually guarantees whatever is being read from pool isn't super important.
 
Joined
Jun 24, 2015
Messages
12
Thanks
0
#20
I am trying to wrap my head around the config that will get me the best speed with redundancy on some existing gear. I think this is the first line of every forlorn post that is responded to with " You can't get there from here"....

I have 9 2TB 7200rpm sata III HGST, on a 3Ware RAID ( a controller that i should not be using ), I have told the RAID to leave the disks as JBOD. I will be replacing the controller with 2 8 port IT mode LSI's since my backplane has 16 disks. I thought that I was going to set up 4 2 way mirrors with a spare. I thought that would give my 4 vdevs and improve my through put ...

AM I supposed to be creating 4 2 way mirrors and then striping across them? Will that give me 4 vdevs? that does not seem to be a gui-able config.
 
Last edited:
Top