How many of you have experienced catastrphic pool failure and why?

Status
Not open for further replies.

trionic

Explorer
Joined
May 1, 2014
Messages
98
ZFS seems to have two modes: "everything's fine" (within the limits of redundancy) and "your pool's gone" (PS: "there are no recovery tools").

All is well until suddenly it's not and all your data's gone. With ZFS the risk of a catastrophic failure are low but the consequences are devastating.

Right now, on the verge of my first ZFS build, I am not comfortable with that consequence. I know, mirrors, backups etc, but some of the reasons for pool loss seem more and more to me like design flaws. For example: loss of a VDEV = pool loss; corrupted system metadata = pool loss. ZFS does not degrade gracefully.

The purpose of this thread is not to trash ZFS but to understand the real probability of pool loss and how to mitigate against it (apart from the obvious mirrors and off-site backup).

So: how many of you have experienced such a loss and why?
 

ser_rhaegar

Patron
Joined
Feb 2, 2014
Messages
358
Virtualized, changed a bios setting and the motherboard/ESXi decided to throw lots of errors through the HBA. Pool was toast. Known issue with motherboard and pass through in ESXi. Restored from mix of hot backup (replication) and cold backup.

Haven't had an issue since I flipped the setting back. Also made the move to physical to make it easier to support hosting VMs.

I also had memory errors with a RAM upgrade on the new physical host. ECC kept the system running flawlessly. Just saw the errors in the nightly email and replaced the RAM. No more errors.

Listen to what the stickies say and you'll be fine. Don't virtualize, use ECC, don't use raidz1, don't stripe single vdevs, load up RAM.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
We've discussed this to death, so I'll keep this short and sweet.

1. Your definition of "design flaws" is totally arbitrary. Saying "Loss of a VDEV = pool loss; corrupted system metadata = pool loss" makes you look silly in my opinion. Are you going to tell me that RAID6 with 3 disks failed should continue to operate? Are you going to tell me that if the file system is corrupt the pool should continue to operate unabated? That is literally the equivalent of a loss of vdevs and corrupted metadata.

2. Every single person I've seen lose data has done it because they made at least one really big mistake. Most have made multiple stupid decisions. Nobody has ever done everything right and still lost data that I have seen on this forum.

So your concern, while not totally unexpected, has no merit unless you are about to break multiple recommendations we make and then claim it as a "design flaw".
 

aufalien

Patron
Joined
Jul 25, 2013
Messages
374
Well, to the contrary ZFS has been more robust then XFS and EXT3 in my env. I'm very comfortable with the latter file systems and have had some odd ball issues over the years which some what catastrophic but nothing a backup and/or archive couldn't some what undue.

However due to the nature of how I rolled out FreeNAS (rather in a hurry), I didn't really mod much and went with defaults. Had some cascading power supply, motherboard, general power failures and UPS failures (Murphy and his Law paid a visit and hung out a while) but FreeNAS hung in there like a champ, rather impressive actually.

So if you don't try to be too clever or a wise guy and think you know something others have missed, you'll be fine and dandy.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I have a v28 pool that I can't import on FreeNAS 9. There's a bug report somewhere on it. Doesn't qualify as a loss since it is peachy on FreeNAS 8, but it is a bit disconcerting.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
To echo what has been said here, the biggest threat to your pool is arrogance. Way back when, I first decided to use FreeNAS with my ancient PIII desktop computer. Needless to say, I didn't have enough RAM and I lost my pool. Of course, at the time, I was invincible and it was clearly FreeNAS's fault for being a PoS program that couldn't run (it's DIY, after all!!!).

Now, I've come to appreciate the pros and cons of FreeNAS. It's not a DIY project for the faint at heart. You have to be ready to put in the time and the money to get the right hardware and set up the system correctly. Things like DB checks need to be added in after the fact, and you need to make sure you've configured your UPS correctly (just to name a few.)

If you can come to terms with the limitations of FreeNAS, then you'll be very successful, and you'll find that FreeNAS is far more stable and reliable than anything else out there. But if you decide that you are better than the system, and start messing around with it, you're gonna have a bad time.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
jgreco: was the pool created on FreeNAS? That behavior is usually indicative of a pool that wasn't created in FreeNAS.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
To echo what has been said here, the biggest threat to your pool is arrogance. Way back when, I first decided to use FreeNAS with my ancient PIII desktop computer. Needless to say, I didn't have enough RAM and I lost my pool. Of course, at the time, I was invincible and it was clearly FreeNAS's fault for being a PoS program that couldn't run (it's DIY, after all!!!).

Now, I've come to appreciate the pros and cons of FreeNAS. It's not a DIY project for the faint at heart. You have to be ready to put in the time and the money to get the right hardware and set up the system correctly. Things like DB checks need to be added in after the fact, and you need to make sure you've configured your UPS correctly (just to name a few.)

If you can come to terms with the limitations of FreeNAS, then you'll be very successful, and you'll find that FreeNAS is far more stable and reliable than anything else out there. But if you decide that you are better than the system, and start messing around with it, you're gonna have a bad time.

If I could like that more than once I would. That is 100% truth.

You have to swallow your pride, realize that everything you know is wrong, and learn how to do it right. Your data depends on you doing it right. Not the way you want, not the way you "think" it should be, not the cheapest way. The right way.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
jgreco: was the pool created on FreeNAS? That behavior is usually indicative of a pool that wasn't created in FreeNAS.


https://bugs.freenas.org/issues/4721

There's a game plan for tracking it down there but I haven't yet gotten annoyed by my inability to log into the box due to FreeBSD 8.3.0 database incompatibilities... (expected problem due to the way I forced a downgrade)
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
I am a software engineer by trade with twenty years' experience (eight years as a freelancer), and am qualified to critically analyse and assess any system's efficacy. That forms a major part of my work with the rest being re-engineering software systems and re-educating organisations about best practice.

To echo what has been said here, the biggest threat to your pool is arrogance.
No accusation of arrogance could ever be made given the hours and days I have spent reading about ZFS/FreeNAS and experimenting with both in VMs. Even after all that research I acknowledge that I am still very much a ZFS/FreeNAS newbie.

I have endeavoured to assemble the components for a robust ZFS server. I will build the machine according to best practice and subject it to the recommended battery of smoke tests. While the tests execute I will experiment further with ZFS/VirtualBox and investigate the best ways to transfer existing data from ten separate hard disks onto the ZFS server.

During the past year I have lost so much (replaceable) data from hard disk failures and mistakes. All my efforts and expense on this project are directed towards avoiding further data loss.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
I am a software engineer by trade with twenty years' experience (eight years as a freelancer), and am qualified to critically analyse and assess any system's efficacy. That forms a major part of my work with the rest being re-engineering software systems and re-educating organisations about best practice.

AKA, arrogance.

I'm not trying to belittle your experience, but you have to understand that it doesn't really count for anything when it comes to FreeNAS and ZFS.

You seem to have the right attitude towards FreeNAS. Just don't let your experience get in the way.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
What Nick2253 said. 99 times out of 100, if someone starts throwing around their experience they're either demonstrating they are totally clueless or they're demonstrating their superiority complex. And about 99x out of 100 when they're demonstrating their superiority complex they are also showing they are totally incompetent while waiving the "I'm a badass" flag. While I don't think his post is applicable to this thread, it's definitely the norm for the forum. ;)

Not that I've seen that behavior in your posts or anything, but more often than not that give me the "red flag" that the person is going to end up a statistic for lost data at some point in the future. Your discussion in the other thread discussing your build seemed very productive and gives the impression you are hellbent on doing it right and aren't pulling the arrogant card with your server.
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
AKA, arrogance.

I'm not trying to belittle your experience, but you have to understand that it doesn't really count for anything when it comes to FreeNAS and ZFS.

You seem to have the right attitude towards FreeNAS. Just don't let your experience get in the way.
And there it is... the predictable reply. I could have written it in advance, verbatim.

Definition of arrogance: "offensive display of superiority or self-importance; overbearing pride". Nothing in my post fulfilled that definition. An arrogant person would have disregarded all advice and existing wisdom and blasted ahead with an ill conceived and flawed build. I did the opposite, acknowledging that I knew nothing and learning about the technology as best I can from informed sources and technical documentation. ZFS is a study in itself and I have barely got started.

Inexperienced engineers are more likely to be arrogant because they assume to know everything. Their subsequent failures build character, make them less arrogant but also more capable. They develop genuine confidence and humility. That's why time-served engineers are usually fascinating and inspiring people to work with and I respect them greatly.

Experience did actually count for a lot here even though I had no knowledge of building ZFS servers. Professional experience of building software systems (failing and succeeding) brought not specific domain knowledge but instead an informed approach to tackling technical projects. My experience of previously getting it wrong of course also has had a positive influence. Few people learn best from getting it right the first time.

99 times out of 100, if someone starts throwing around their experience they're either demonstrating they are totally clueless or they're demonstrating their superiority complex. And about 99x out of 100 when they're demonstrating their superiority complex they are also showing they are totally incompetent while waiving the "I'm a badass" flag. While I don't think his post is applicable to this thread, it's definitely the norm for the forum. ;)

Not that I've seen that behavior in your posts or anything, but more often than not that give me the "red flag" that the person is going to end up a statistic for lost data at some point in the future. Your discussion in the other thread discussing your build seemed very productive and gives the impression you are hellbent on doing it right and aren't pulling the arrogant card with your server.
I am glad that you see the distinction between the two.

For years I was a moderator on a car forum and too learnt to recognise those user stereotypes but I also learnt to keep an open mind and let people show their true colours. Sometimes I would be pleasantly surprised.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I am glad that you see the distinction between the two.

For years I was a moderator on a car forum and too learnt to recognise those user stereotypes but I also learnt to keep an open mind and let people show their true colours. Sometimes I would be pleasantly surprised.

Oh yeah.. you get to like 3k posts and you can figure out who will lose data and who won't just by their first thread.
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
Based on how a person responded to advice it wasn't that hard for any forum regular to figure out whether they were about to screw up that car repair job.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Few people learn best from getting it right the first time.

Oh, but, something vaguely the inverse, you can get it right the first time if you take the time to learn... and you'll find that many of us here are quite interested in helping an open mind understand where the old models break down, and how to do ZFS correctly.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
No catastrophic failures yet but a few close ones..
 

9C1 Newbee

Patron
Joined
Oct 9, 2012
Messages
485
For years I was a moderator on a car forum

Alright, what kind of car? That too will give me some insight about you. LOL

Your car forum skills will help you find out who knows their ish on this forum. That, and I am going to just come out and tell you straight away. jgreco and cyberjock (and a few others) kept me on the straight and narrow building a pretty bad ass system. I think you have the right attitude. You certainly have a good skill set to be productive member of our community. I'm glad you are here. Welcome.
 

trionic

Explorer
Joined
May 1, 2014
Messages
98
Thanks for the welcome :)

The car forum was SaabCentral (you thought I was going to say VWVortex, right? ;)). I love old Saabs, particularly the pre-1994 900 Turbos. I have done almost every job you can do on a 900 (apart from sheet metal work and gearbox rebuilds), usually starting from zero knowledge. I really enjoyed paying back the knowledge and I wrote a few guides.

Eventually though other demands took over, I went from reading every forum post to reading none. The site was sold-off by the UK owners and I was replaced as a moderator. I haven't been on the site for ages. However, if you ever need to know how to replace a clutch on a Saab 900 then I'm your man.

On any forum there are always a handful of key contributors and it was immediately clear that jgreco and cyberjock were two of those. They must have helped hundreds of people create decent ZFS platforms, saving them from rookie mistakes and consequent data loss.

I read this thread recently and was deeply impressed with the awesome technical help and commitment given to the OP from some gifted FreeNASers.
 
Status
Not open for further replies.
Top