RAIDZ1 vs. RAID 5 && UREs

Status
Not open for further replies.

djdwosk97

Patron
Joined
Jun 12, 2015
Messages
382
So I've been trying to get a grasp on this for a while and I haven't really gotten any decently satisfactory answer. I've read most of the articles I could find on the issue, but still....I'm not happy with where I'm at right now.


So three part question....

First off, what exactly is a URE -- I get what it is, but I haven't been able to figure out why running into a URE during a rebuild would result in an array failing to rebuild. Why wouldn't a URE just cause a single file/sector-worth of files to become corrupt? Does scrubbing help reduce the chances of hitting a URE?

Secondly, lets assume I'm running a five drive RAIDZ2 array, one drive failed and has been replaced with a blank drive. The processes of rebuilding the array begins and drive #4 runs into a URE. Since it's RAIDZ2 there can be a second drive failure and thus the array continues to rebuild. At this point would that drive be non-functional or would it still remain functional once it's passed that bad sector? So if you were to run into another URE (this time on drive #3) would the array fail to rebuild (because drive #4 is bad) or would it continue to rebuild (because drive #4 is now fine since it's passed the bad sector)?

Lastly, is there a difference between RAIDZ1/RAID5 (or Z2/6)?
 
Last edited:

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
I haven't been able to figure out why running into a URE during a rebuild would result in an array failing to rebuild. Why wouldn't a URE just cause a single file/sector-worth of files to become corrupt?
If it occurs in data, you might only lose data. If it occurs in metadata, you might lose the pool.
Does scrubbing help reduce the chances of hitting a URE?
Yes, in the sense that regular scrubs are supposed to find and fix corruption while all drives are still online, instead of waiting until a drive has failed, at which point you have reduced redundancy.
would that drive be non-functional or would it still remain functional
Depends on the nature of the failure. A single read failure should not cause the drive to immediately go offline.
is there a difference between RAIDZ1/RAID5 (or Z2/6)?
Yes, the ones with Zs in them are ZFS configurations. ZFS is quite different from other RAID systems.
 
Joined
Apr 9, 2015
Messages
1,258
https://en.wikipedia.org/wiki/RAID#URE

A unrecoverable read error should not kick a drive out of the vDev/pool the system will actually be reading the data from multiple places and proceed though the system should throw an alert to a URE which will be fixed with a srub. The next error will do the same thing even though it is on a different drive. Higher version of raidZ will have better chances of surviving URE's as vDev's as well as Pool's increase in size.https://en.wikipedia.org/wiki/RAID#Increasing_rebuild_time_and_failure_probability

Yes, raid5 is 100% hardware while raidZ1 is software based. They have the same level of redundancy but some differences most importantly:

RAID-Z avoids the RAID-5 write hole by distributing logical blocks among disks whereas RAID-5 aggregates unrelated blocks into fixed-width stripes protected by a parity block. This actually means that RAID-Z is far more similar to RAID-3 where blocks are carved up and distributed among the disks; whereas RAID-5 puts a single block on a single disk, RAID-Z and RAID-3 must access all disks to read a single block thus reducing the effective IOPS.
https://blogs.oracle.com/ahl/entry/what_is_raid_z
 

djdwosk97

Patron
Joined
Jun 12, 2015
Messages
382
Yes, in the sense that regular scrubs are supposed to find and fix corruption while all drives are still online, instead of waiting until a drive has failed, at which point you have reduced redundancy.

Depends on the nature of the failure. A single read failure should not cause the drive to immediately go offline.

Yes, the ones with Zs in them are ZFS configurations. ZFS is quite different from other RAID systems.
Is it possible to encounter a URE on a drive that has just been scrubbed and where no errors were found? (assuming no errors occurred in the interim -- is a URE caused solely by corrupted data or could it also be caused by a bad read)?

So in RAIDZ2 (five drives, #5 failed and is being rebuilt, #4 hits a URE) drive #3 will come to the rescue, and the rebuild will continue at which point drive #4 is back in the array and is "fully functional" (and that bit of corrupted data that caused the URE would be fixed by ZFS)? So if any drive were to encounter a URE it would be like the first URE was never encountered? But if it was RAIDZ1, then the array would have failed to rebuild, or would it still not have been dropped from the array -- this is where most of my confusion comes from as from what I've read RAID5 would hit a URE and then just fail to rebuild, so is RAIDZ1 different, or is that not the case with RAID5 either)?

I understand that much, I was more curious to know if there is a functional difference between the two (as in, would encountering a URE during a RAID5 rebuild be different than encountering a URE during a RAIDZ1 rebuild)?

https://en.wikipedia.org/wiki/RAID#URE

A unrecoverable read error should not kick a drive out of the vDev/pool the system will actually be reading the data from multiple places and proceed though the system should throw an alert to a URE which will be fixed with a srub. The next error will do the same thing even though it is on a different drive. Higher version of raidZ will have better chances of surviving URE's as vDev's as well as Pool's increase in size.https://en.wikipedia.org/wiki/RAID#Increasing_rebuild_time_and_failure_probability

Yes, raid5 is 100% hardware while raidZ1 is software based. They have the same level of redundancy but some differences most importantly:

https://blogs.oracle.com/ahl/entry/what_is_raid_z
So (in a RaidZ2 array where one drive has failed and has been replaced by a blank disk) as long as you don't hit a URE on two disks at the same time/when dealing with the same data then you shouldn't have to worry about an array failing to rebuild (due to a URE)?

I understand that much, I was more curious to know if there is a functional difference between the two (as in, would encountering a URE during a RAID5 rebuild be different than encountering a URE during a RAIDZ1 rebuild)?
 
Last edited:
Joined
Apr 9, 2015
Messages
1,258
Could you encounter a URE after a scrub, sure. Is it likely, that depends a scrub actively reads all the data and attempts to make any repairs needed. There are other things at play that could cause a URE but it is much less likely to happen than on a raid5 from my understanding. On a standard raid5 setup you can scan the disks with something like chkdsk or fsck. A URE can be caused by a lot of things from a bad block to a firmware glitch in the drive to a file error, it's just a general term.

https://docs.oracle.com/cd/E18752_01/html/819-5461/gbbwa.html http://prefetch.net/blog/index.php/...ture-to-verify-the-integrity-of-your-storage/

Raid5 and raidZ1 encountering a URE during a rebuild would result in at least some data loss and could result in the loss of the pool. Since a URE is much more likely to happen on drives that are very large we increase the number of drives used for redundancy. With ZFS pools we can have HUGE amounts of data available so if you were to have say 50 X 6TB drives in a pool you would likely want to have something like five vDev's with ten drives in each one in a raidZ3. The reason for more redundancy is that any single vDev failing will cause the pool to fail with a setup of that nature you would have around 210TB of space available. http://wintelguy.com/raidcalc.pl With that amount of data you are pretty well guaranteed to have a URE at some point during it's lifetime. http://www.raidtips.com/raid5-ure.aspx

A raidZ2 would be much less likely to have problems rebuilding due to a URE and a raidZ3 would be even less likely. The more redundancy also helps to protect from a second drive failure during resilvering.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Could you encounter a URE after a scrub, sure. Is it likely, that depends a scrub actively reads all the data and attempts to make any repairs needed.

With all due respect to the awesome amounts of very good information provided so far, there's one other aspect to a scrub that's relevant:

A scrub or resilver only touches pool used space (data and metadata). It does not do anything with the unused portions of the disk.

This is different from RAID5, where the RAID controller has no knowledge of what might be on the disk, and so rebuilds parity for every disk and every sector.
 

djdwosk97

Patron
Joined
Jun 12, 2015
Messages
382
1. A URE can be caused by a lot of things from a bad block to a firmware glitch in the drive to a file error, it's just a general term.

https://docs.oracle.com/cd/E18752_01/html/819-5461/gbbwa.html http://prefetch.net/blog/index.php/...ture-to-verify-the-integrity-of-your-storage/

2. Raid5 and raidZ1 encountering a URE during a rebuild would result in at least some data loss and could result in the loss of the pool. Since a URE is much more likely to happen on drives that are very large we increase the number of drives used for redundancy. With ZFS pools we can have HUGE amounts of data available so if you were to have say 50 X 6TB drives in a pool you would likely want to have something like five vDev's with ten drives in each one in a raidZ3. The reason for more redundancy is that any single vDev failing will cause the pool to fail with a setup of that nature you would have around 210TB of space available. http://wintelguy.com/raidcalc.pl With that amount of data you are pretty well guaranteed to have a URE at some point during it's lifetime. http://www.raidtips.com/raid5-ure.aspx

3. A raidZ2 would be much less likely to have problems rebuilding due to a URE and a raidZ3 would be even less likely. The more redundancy also helps to protect from a second drive failure during resilvering.
1. Okay, that's what I was after, thanks.

2/3.So I understand the benefits of using higher levels of redundancy in order to minimize risks, that's not really what I'm asking. I'm just really curious to know more about UREs on a conceptual level.

This is different from RAID5, where the RAID controller has no knowledge of what might be on the disk, and so rebuilds parity for every disk and every sector.
Does that make any difference (the filesystem being aware of the actual files on the disk) if the disk is full?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Does that make any difference (the filesystem being aware of the actual files on the disk) if the disk is full?

Well, yes, because the way ZFS validates its data is to traverse the tree, which has various performance implications, including that a resilver/scrub operation on a very full pool will be slower than a RAID5 rebuild/patrol. The operation is probably more stressy on a ZFS pool because it can involve a fair bit of seeking.
 

djdwosk97

Patron
Joined
Jun 12, 2015
Messages
382
Well, yes, because the way ZFS validates its data is to traverse the tree, which has various performance implications, including that a resilver/scrub operation on a very full pool will be slower than a RAID5 rebuild/patrol. The operation is probably more stressy on a ZFS pool because it can involve a fair bit of seeking.
That makes sense.

Okay, so back to my initial three questions....
1. So if a URE is on a data block, then any files in that block will likely be corrupt, but the array will continue to be rebuilt; but if the URE is on a metadata block, then the URE could potentially cause the array to fail to rebuild (assuming RAIDZ1). Would this also be the case in RAID5?

2. And in RAIDZ2/6 (assuming no other drives fail in the process) as long as you don't hit two UREs concurrently then a URE can never cause an array to fail to rebuild (or corrupt any data) regardless of where the URE is.

3. A few differences have been explained, but would a URE (or any other rebuild-related error) be handled differently between Z1/5?
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
RAID5 will basically die if you get a URE during a rebuild, because the RAID controller has no idea whether that block is empty, or is the most important data on the array.

ZFS will lose one block of data(unless copies=2). The odds of it losing the pool or metadata are close to zero, because metadata is stored at least twice.

answer to #2: Yep.

#3. Yes, as explained above. Raid5 loses the whole array; Z1 loses one block of data in a file, and will tell you which file so you can restore it.

This is why scheduled scrubs are very useful; they dramatically lower the risk of anything bad happening during a resilver. ZFS also corrects corruption, which RAID can not usually do.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That makes sense.

Okay, so back to my initial three questions....
1. So if a URE is on a data block, then any files in that block will likely be corrupt, but the array will continue to be rebuilt; but if the URE is on a metadata block, then the URE could potentially cause the array to fail to rebuild (assuming RAIDZ1). Would this also be the case in RAID5?

2. And in RAIDZ2/6 (assuming no other drives fail in the process) as long as you don't hit two UREs concurrently then a URE can never cause an array to fail to rebuild (or corrupt any data) regardless of where the URE is.

3. A few differences have been explained, but would a URE (or any other rebuild-related error) be handled differently between Z1/5?

#1. For hardware RAID it may cause the RAID5 rebuild to fail. It may not. It depends on factors such as how long it takes the disk to timeout on the error, how long the RAID controller will wait before timing out the disk, what does the RAID controller's firmware do when an error is encountered. There is no "yes/no" to this question.

For ZFS, it's supposed to simply accept the data it received since you have no more redundancy. If that means that the data received leads to a kernel panic... your resilver just failed catastrophically. At this point, there's the possibility that you may not even be able to mount the zpool again (aka the data is now irretrievable).

#2. As long as you never have 3 UREs during a particular "stripe" read, then all will be fine. Yes, that means that in theory you can have corruption scattered all over all of the disks, but as long as there is always enough redundancy to recover for each "stripe" then everything will be okay.

#3. That depends. Some SATA/SAS controllers are more prone to disconnecting disks from the system that end up in an error recovery mode too much. If you're unfortunate enough to have a disk with a single bad sector that forces an error recovery mode on the disk to exceed your SATA/SAS disk's timeout value, the disk will disconnect from the system because of a single failed sector. One of the reasons we advocate for the LSI stuff is that this is not typically, except in *very* bad disks. All the other controllers are hit and miss on how conservative or liberal they are with dropping disks.

See why the discussion is never ending? The questions you ask aren't yes/no questions. They're more like statistical probabilities, how lucky you are, and whether you've made good choices with your FreeNAS server or not. People (especially IT people) don't seem to do particularly well with these and they try to simplify it down to a yes/no answer. You can't do that without losing very important concepts that matter to the hypothetical scenario.
 

djdwosk97

Patron
Joined
Jun 12, 2015
Messages
382
See why the discussion is never ending? The questions you ask aren't yes/no questions. They're more like statistical probabilities, how lucky you are, and whether you've made good choices with your FreeNAS server or not. People (especially IT people) don't seem to do particularly well with these and they try to simplify it down to a yes/no answer. You can't do that without losing very important concepts that matter to the hypothetical scenario.
I didn't really expect a yes or no answer, but the issue is more that I've read sources that say one thing and other sources that say the complete opposite. So it may be the case that both are true depending on what you consider truth.

Anyway, I think this thread has cleared things up.
 

bestboy

Contributor
Joined
Jun 8, 2014
Messages
198
The odds of it losing the pool or metadata are close to zero, because metadata is stored at least twice.

Yes, I agree. Not only is metadata stored twice, the copy is also distributed and stored on a different vdev.

Furthermore on a typical storage pool only 2% of the data blocks are metadata. So the fact that a disk URE, which is also unrecoverable by ZFS due to insufficient redundancy, affects metadata is kinda low. I guess in most cases the result of such a "ZFS URE" is a broken user file. And for the typical home user that's really no biggy. ZFS tells you which file it lost and you can simply restore it from backup...

https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yes, I agree. Not only is metadata stored twice, the copy is also distributed and stored on a different vdev.

Furthermore on a typical storage pool only 2% of the data blocks are metadata. So the fact that a disk URE, which is also unrecoverable by ZFS due to insufficient redundancy, affects metadata is kinda low. I guess in most cases the result of such a "ZFS URE" is a broken user file. And for the typical home user that's really no biggy. ZFS tells you which file it lost and you can simply restore it from backup...

https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape

I'm going to fill you in on two secrets:

1. Most of the users here have only 1 vdev. So the fact that a second copy might be able to be distributed and stored on a different vdev doesn't matter.. they have only 1 vdev. Only 1 of my zpools has more than 1 vdev, and that's because I deliberately did mirrors for iSCSI performance.

2. Right now I have three zpools. One is using about 10% for metadata, another is about 14%, and another is 26%. So I find that 2% claim to be just a step above hilarious (where did you get that number? I'm really curious who the heck would lowball it so badly). If memory serves me right, FreeNAS estimates your freespace based on 10% (maybe 12%?) for metadata. The quantity of metadata depends on your block size. If you are using iSCSI, your block size is 8K, which is generates much more data than your common CIFS share using a dataset with 128KB block sizes. The zpool that is 26% just so happens to be iSCSI only.

As I reminded one of our regulars last week, linking to stuff on Oracle's ZFS is not a good idea because Oracle's ZFS is not the same as the OpenZFS code. There are massive, massive differences between them.

Additionally, quoting stuff from 2006 is also a bad idea because back then ZFS was only a year old. Yes. That's how old it was, and so much has changed in various ways that it's generally a *really* bad idea to use documentation that is that dated. ;)

HTH
 

bestboy

Contributor
Joined
Jun 8, 2014
Messages
198
Well, the 2% number for metadata is from the linked Oracle blog. It seemed kinda plausible to me, but I have no real world numbers to compare to.
May I ask how you queried your metadata rate? I would be interested in checking mine, too.

Also, are these numbers indicating the disk space of the pool used by metadata? If this is the case, then I'd assume the 2 respective 3 metadata block copies are factored in.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There's a zdb command you can run that spits out the size of metadata, file data, and free space. I'm about to go out to dinner with friends, but PM me and I'll get you the command when I get back later tonight or tomorrow.

I think it's more likely that the 2% is very, very outdated and very inaccurate for OpenZFS. Oracle's ZFS implementation has very little to do with OpenZFS's implementation. We've got lots of stuff that Oracle doesn't have... and it's all stored as "metadata". :D
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Okay, so back to my initial three questions....
1. So if a URE is on a data block, then any files in that block will likely be corrupt, but the array will continue to be rebuilt; but if the URE is on a metadata block, then the URE could potentially cause the array to fail to rebuild (assuming RAIDZ1). Would this also be the case in RAID5?

The ZFS metadata is redundant, so a URE on metadata is actually unlikely to be fatal. If ZFS cannot recover the data, then it sees a zeroed block (because the checksum failed), which can present some operational problems. That could include a crash. ZFS has no good way to recover from this since it's a "this should never happen" type of event.

With RAID5, there's no indication as to what the filesystem might be doing since RAID5 has nothing to do with it. The RAID5 controller should merely try to rebuild redundancy from the remaining disks, and any it can't read may result in corrupted data. A traditional filesystem can usually cope with that because they've been designed to run on non-RAID5 disks and have things like chkdsk or fsck or whatever to help cope, but of course "data may be lost." However, a RAID5 controller might decide to do other things, such as quit a rebuild due to a lot of read errors, and in this case then you get the annoyed sysadmin who has to copy the fail-y disk to a brand new disk without errors, then put it back into the array and then try a rebuild again... but you're actually pretty likely to come up with a recoverable filesystem as long as important metadata hasn't been lost.

This is why ZFS proponents tend to be very in-favor of things like RAIDZ2 or RAIDZ3. Redundancy is critical to the ability of ZFS to work correctly.

2. And in RAIDZ2/6 (assuming no other drives fail in the process) as long as you don't hit two UREs concurrently then a URE can never cause an array to fail to rebuild (or corrupt any data) regardless of where the URE is.

I'll say "in theory," because in practice "s*** happens". Usually a URE is not a magic "one block went bad" event. It's usually more like "ten thousand blocks went bad because of a head crash" type thing.

3. A few differences have been explained, but would a URE (or any other rebuild-related error) be handled differently between Z1/5?

"Yes." It's like asking if there's a difference between riding a motorcycle and driving a truck, and what the differences are. They're just very different things, and the list of differences is a lot bigger than the list of superficial similarities.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I'm going to fill you in on two secrets:

1. Most of the users here have only 1 vdev. So the fact that a second copy might be able to be distributed and stored on a different vdev doesn't matter.. they have only 1 vdev. Only 1 of my zpools has more than 1 vdev, and that's because I deliberately did mirrors for iSCSI performance.

Worthless "secret." The loss of a vdev is the loss of pool, and for a single vdev pool, the second copy is stored on the single vdev. There is no scenario where your "secret" matters. :smile:
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
There's a zdb command you can run that spits out the size of metadata, file data, and free space. I'm about to go out to dinner with friends, but PM me and I'll get you the command when I get back later tonight or tomorrow.

Or just type "man zdb" from the CLI and figure out that it's probably "zdb -bb" :tongue:


I think it's more likely that the 2% is very, very outdated and very inaccurate for OpenZFS. Oracle's ZFS implementation has very little to do with OpenZFS's implementation. We've got lots of stuff that Oracle doesn't have... and it's all stored as "metadata". :D

As with almost anything ZFS, that's largely dependent on so many factors. And don't start throwing stones over towards Oracle, they've got some nice stuff we don't have.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
As with almost anything ZFS, that's largely dependent on so many factors. And don't start throwing stones over towards Oracle, they've got some nice stuff we don't have.

Hey, you're the one throwing stones. I simply stated that OpenZFS and Oracle's ZFS aren't the same. I never said one was inferior to the other..... but thanks for assuming I said that.
 
Status
Not open for further replies.
Top