RAIDZ and Unrecoverable Bit Errors

skis4hire · Jan 22, 2015

The purpose of RAIDZx is to prevent the need to re-create your zpool from backup if x drives fail simultaneously.

That is to say that RAIDZx is not a standalone backup solution.
I think this is important perspective for the rest of the post so I'll say it again:

The purpose of RAIDZx is to prevent the need to re-create your zpool from backup if x drives fail simultaneously. (plus of course protecting not yet backed-up data...)

To this end, there's been some discussion about what 'x' should be in RAIDZx, mostly based on the article below which raises great points about the problems with traditional RAID5 with current HDDs.
http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162

The point being, with normal RAID5, an Unrecoverable Bit Error (UBE) during restripeing means the entire RAID volume is likely (certainly?) unrecoverable, hence the need for RAID6.
But ZFS is different and handles checksumming and resilvering on the block level.

So, my ultimate question after learning about all of this was - what exactly happens with ZFS if there is an unrecoverable bit error (UBE) during resilver of a RAIDZ1 vdev?

As far as I can tell, the answer is - you get 1 corrupted block, for 1 UBE, but you keep the rest of your data and zpool.
Assuming the bit errors are random, if you have multiple errors, it corrupts multiple blocks and by extension multiple files.
What if the UBE is in metadata? This can potentially cause a LOT of additional data to be lost.
Well, there are at least 2 copies of all metadata in ZFS, each with their own parity data (zpool-critical metadata is stored 3x). So a UBE on metadata can fall back on the copy.

That leaves us with the following conclusion for RAIDZx:

RAIDZx protects against the simultaneous failure of x disks, and 1 UBE during resilver of those x disks results in the corruption of 1 block/file without losing the zpool.

If you can tolerate the corruption of 1 or more single blocks or files, then this is not a problem.
ZFS should tell you which files are corrupted and you can restore them from backup.
My point being, I don't see a UBE during resilver as an end-all situation.
I'll let you draw your own conclusion on what implication this has for RAIDZx selection.

I don't want to make this too long but I'll briefly consider the chance of a 2nd disk completely failing during resilver.
I find MTBF and MTTDL basically meaningless. (See https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl, and https://www.usenix.org/legacy/event/hotstorage10/tech/full_papers/Greenan.pdf)

Instead you can look at the annual failure rate which gives a more real-world sense of your risk.
I'll use the data from Backblaze (http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/)
This shows an annual failure rate of about 4%. If you assume a uniform distribution of failures, then this is a failure rate of ~4.5E-6 per hour.
For a 48 hour resilver window, you get a failure rate of ~2.2E-4 (0.022%). That's 1 failure in ~4500 resilvers.

For a home user with backups, this may be an acceptable risk. For a critical business application, maybe not.

Welcome thoughts, corrections, and perspectives :)

cyberjock · Jan 23, 2015

I'd say you're pretty close to spot on. The problem is exasperated by other outside influences though.

Take me as an example. I built my system and I knew a few things up front:
1. I wouldn't have a backup of 100% of my data.
2. I wanted to build my zpool and be able to trust it for at least 2-3 years (hopefully more).
3. I don't want to be in this never-ending hardware upgrade loop. I wanted to build it and leave it for at least 2 years.

Keep in mind a few things:

- Corruption of the zpool metadata could result in the box crashing.
- Corruption of the zpool metadata could result in resilvering and/or scrubbing being impossible to perform.
- Corruption of the zpool metadata could result in the pool going offline and being unable to be mounted ever again.
- Any of the above 3 could happen at any time and without warning once you have zpool metadata corruption.
- The above 4 bullets are verifiable to be possible from prior users, so they aren't even up for debate.
- Once a zpool is unmountable, there are no recovery options that are less than 5-figures. PERIOD. (In essence if you couldn't afford to go with all that stuff we recommend.. ECC RAM, RAIDZ2, etc etc, then there is no chance in hell you can ever afford to recover your data from your unmountable zpool).

So looking at the above bullets, it's pretty obvious that if I want to accomplish 1-3 above, then I definitely need to do my best to avoid corruption. In the event that I were to have corruption of my metadata, the only solution is to blow away the pool and rebuild. But without 100% backups I'd have data loss (for most it would be pretty significant data loss too). So at the moment that my pool sees corruption I'm in this crappy state where things can go badly suddenly and without warning (resulting in more data loss) and I have no way to get out of this predicament without destroying my pool (and losing more data because I have no full backup).

So you can simplify it down to "If I don't prevent metadata corruption I can and should expect large amounts of data loss, up to 100%).

MTBF and MTTDL don't take into account things like a small number of errors (such as URE) resulting in a total data loss (which is absolutely a possibility with ZFS). Annual failure rates are also equally meaningless because a disk isn't "failed" until it is VERY broken. Things such as a handful of UREs are devastating for ZFS, but not considered to meet the requirements for typical AFR calculations. You also need to consider that when you put a bunch of disks together and one disk fails, the chances of another disk failing within 4 hours increases spectacularly. So you not only need to overcome that one disk that just failed, you need to overcome the significantly higher likelyhood that since one disk just failed another one is much more likely to fail very soon (in fact, before you could complete the resilver if you started resilvering the instant that the first disk failed).

So when building a system, do I go with RAIDZ1 and use hope and luck to make sure I don't end up in this crappy lose-lose situation or do I buy that one extra disk, go with RAIDZ2, and not only significantly decrease the chances of data loss, but I significantly increase the chances that I won't actually need that 100% backup that doesn't exist to begin with?

To make matters worse, many people are pretty negligent with their FreeNAS server. Common problems that result from negligence are:

- The server is set up without any kind of emailing or proper maintenance schedule to identify failing disks and never touch the server again until their shares disappear (usually because they lost the pool).
- For many people they don't even realize that their replication task (or rsync) stopped working weeks/months ago until its too late. If you are using rsync for your backups (which I hate btw) then you've probably rsynced the bad corrupted files right over the good files that were on the backup server. Hooray for backups!
- Often once you have corruption of the ZFS metadata you'll find that the snapshots and replication are not always usable. So your backup schedule might betray you. Yay!

By far, more than 95% of pools I've seen people have to rebuild were RAIDZ1. Not that RAIDZ1 is 100% to blame for all of them, but usually if someone is going with RAIDZ1 they are cutting every single corner they can (reusing that spare hardware RAID controller because they have it, reusing or buying that cheap desktop because it's cheaper than server-grade, not using ECC and arguing that its too expensive, and virtualizing FreeNAS come to kind as the most common problems that people also have). But users don't know they've cut too many corners until it blows up in their face.

RAIDZ2 and RAIDZ3 give more time to total pool failure, so often people that do a RAIDZ2 and screwed up really badly by doing something like using a hardware RAID controller find out they lost a disk and realize they can't do a disk replacement or complete a resilver because of their poor hardware choices. But they at least have some time to backup their data and do things correctly versus that sucker that went RAIDZ1 and lost it all at once.

Seems pretty straightforward that RAIDZ1 is a suicide mission for your data (and your plan). If there were a flow chart for this, it pretty much always starts with "Is your metadata corrupted? Yes." and ends with "You lost most/all of your data". Overall, this is simple risk assessment, mitigation and management. You have to balance the risk versus the gains. The risks are incredibly high (the intent of a file server is to store your data, and people often build a system that can't even do that reliably), and the gains are pretty low (if you had bought that one extra disks and gone RAIDZ2 you'd still have your data after that URE). Unfortunately geeks seem to be horribly bad at assessing risk, mitigating such risks, and managing the risks they have. They want it cheap, fast, and reliable. Then we've got to give them the bad news and tell them they've got to rebuild their server and they have no data to store anymore.

skis4hire · Jan 23, 2015

Thanks for the reply. I think we're on the same page. RAIDZ2 is the standard.
I don't want to seem like I'm recommending a lesser setup and I think your post takes care of that :)

Personally, I'm interested in ZFS for home use where my NAS will be only 1 location of several for my critical data.
So I'm OK with my setup being a bit of an experiment and understand the risk.
I have some further Ideas for my critical data and I'll make a separate post about it.

Your point about building for 2-3 year+ reliability is well taken.

I am interested in the metadata corruption cases you mention, whether people were using ECC RAM and running regular scrubs.
From what I've read, metadata has 2 copies plus parity, and some higher level meta data (zpool critical) has 3 copies plus parity.
Not to say you can't still screw things up, but it would seem to require a severe event.

cyberjock · Jan 23, 2015

skis4hire said:
I am interested in the metadata corruption cases you mention, whether people were using ECC RAM and running regular scrubs.
From what I've read, metadata has 2 copies plus parity, and some higher level meta data (zpool critical) has 3 copies plus parity.
Not to say you can't still screw things up, but it would seem to require a severe event.

For many they are using ECC. For some they weren't, but that really doesn't have a bearing on the ultimate outcome so long as their RAM isn't actually corrupting bits or being affected by cosmic rays and such. Ultimately you'd never be able to prove either way anyway. Let's also be honest with ourselves for a minute. Once you start re-silvering a disk you're going to be taxing all of the disks that have your data. They're going to get hotter as they are worked more and more. Then chances of a second disk deciding that there are errors aplenty goes way up. Resilvering really is the worst workload you want to put your disks through when you also have zero redundancy. But it's something that you basically gotta deal with and gotta take the risk.

Yes, some stuff has multiple copies. Not all of it, but some. The stuff that doesn't have multiple copies is where your system gets boned. Some of the critical stuff (like the ZFS labels) have 4 copies (2 at the beginning and 2 at the end of the given partition). We've seen people corrupt all 4 (just had one the other night in IRC.. poor guy that I won't embarass more by putting his name here).

Ultimately, regardless of everything else you might do right, if the data doesn't exist because of some UREs, there is going to be chaos. That chaos might be some file you don't care about being inaccessible, or it might take the whole pool down. I do not believe in playing with luck (hence I don't gamble) so I assume that if I have a single URE that thing is gonna tear my pool down while laughing at me. "Ain't nobody got time for that!"

My main pool is 10x6TB drives in a RAIDZ2. I *seriously* considered going RAIDZ3 because of the size of the disks. I debated it with myself for 2 days, read up on statistics of losing my data and decided I'd go RAIDZ2 and keep a spare on a shelf.

Ultimately people go to ZFS because they want all that tasty bit-rot protection. But what do you do when you want it so badly you're willing to learn a whole new OS (FreeNAS) and then turn around and deliberately cut off your nose by going with RAIDZ1. To be honest, if a friend wanted to go RAIDZ1 I'd tell him to stick to Windows and go with RAID5. At least if you fubar your hardware RAID on Windows there *are* tools out there that exist to recover your data after you find out first-hand that RAID5 is dead. So you'll get to never see your data again, then will likely get a sizable portion of it back with recovery tools. With ZFS, once you can't do "zpool import tank" you are screwed. You curl up in the fetal position and cry.

sremick · Jan 25, 2015

cyberjock said:
I did manage to get about 95% of his files back.

I'd be very interested in more info about what tricks you were able to use to recover files off his drives. I realize there's plenty about his situation that might not apply to others' but it could still be useful to have documented. :)

Fraoch · Jan 26, 2015

OK, so I can't resist...

Important Announcement for the TrueNAS Community.

RAIDZ and Unrecoverable Bit Errors

skis4hire

Cadet

cyberjock

Inactive Account

skis4hire

Cadet

cyberjock

Inactive Account

sremick

Patron

Fraoch

Patron

Similar threads

Important Announcement for the TrueNAS Community.

RAIDZ and Unrecoverable Bit Errors

skis4hire

Cadet

cyberjock

Inactive Account

skis4hire

Cadet

cyberjock

Inactive Account

sremick

Patron

Fraoch

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "RAIDZ and Unrecoverable Bit Errors"

Similar threads