How many of you have experienced catastrphic pool failure and why?

ClassicGOD · Jun 9, 2014

I'm a total FreeNAS noob but I managed to avoid any major disasters. I rejected my "experience" with AIX and other unix-like systems and I try to follow everything cyberjock writes if my budget allows me to ;)

The closest I came to losing my data was when I was transferring my old pool (RAID-Z1) to my new one and two of the 4 old drives disconnected mid flight due to sata cables becoming lose (10 drives in mITX box = flying spaghetti monster of sata cables). I was certain my old pool and all the data was gone. But 2 replaced cables and a reboot later my data was still there with a message going something like: "Hey! There was an error, check if your data is OK" , did a scrub to confirm and I'm still impressed that it managed to survive that :P

dtemp · Jun 9, 2014

Nick2253 said:
Things like DB checks need to be added in after the fact.

Can you post a link for more info on this? I have my scrubs and SMART tests scheduled, I back up my config, this seems like something new I should be doing?

To answer OPs question... I'll tell you how I *almost* lost a pool. It was from being a hotshot and trying to hot swap within my case without a backplane, meaning I had to manually disconnect and reconnect power and SATA cables. While swapping one bum drive, I knocked the SATA cable for another drive, which made it unhappy. I was running RAIDZ2 so I was fine against even 2 failures, but I can see how bumping the wrong wires with the wrong config could screw someone. From now on, I'm doing cold swaps.

trionic · Aug 1, 2014

So there I am, with all of this system built by the book using fully tested enterprise-grade hardware, fully tested disks, UPS etc etc, happily copying files from one pool to another and... spontaneous reboot. System starts back up and... spontaneous reboot. Start in safe mode... kernel panic.

Remove all the disks, system starts fine. Fit disks for pool01... spontaneous reboot. Remove pool01 disks, fit pool02 disks... system starts fine. Run 'zpool import pool01'.... spontaneous reboot.

pool01 it seems is gone.

cyberjock · Aug 1, 2014

trionic:

So tell me why you have two pools? Are you upgrading? Where did the "old" pool come from and where is the "new pool" coming from? Considering you had a build list in May something doesn't compute. Nobody builds a server with the expectation of buying a whole new server just a few months later.

Please explain in detail what you have done and what you are doing. I want to understand the situation in its entirety.

I also ask that you not try to use your server while I try to come up with ways to get your old pool back. If you start doing things on your own without guidance you may make this situation unrecoverable. I'll help you out as much as I possibly can, but if you start doing your own thing it can ruin the troubleshooting plan I will devices.

Thanks.

trionic · Aug 1, 2014

Hello cyberjock. I had the build list in May, bought everything, built the server, took loads of photos and was planning to update the build thread with hopefully useful information for anyone else embarking on this journey. The server I referred to in my post is the same one on which I received all the excellent advice.

The two pools are 6x4TB and 6x3TB. I chose 4TB and 3TB pools for ease of storage expansion later on. I plan to eventually fill the chassis with 4TB drives but right now I have quite a few 3TB drives. Rather than use resilvering to expand the pool from 3TB to 4TB drives, I decided to have one pool of entirely 4TB drives and one pool of entirely 3TB drives. I'd add further vdevs to the 4TB pool and copy data from the 3TB pool, eventually replacing all the 3TB drives (which may go into a JBOD enclosure for backups).

Some background information: after playing around with FreeNAS and NAS4Free in VMs I proceeded to do the same on the physical server. TBH I was currently running NAS4Free, although for the core ZFS stuff I've started using the terminal more while saving the FreeNAS/NAS4Free GUI for setting up users, Samba and the like.

The data on pool01 had been copied from numerous NTFS drives, using ntfs-3g in read-only mode. That was done a week ago and worked very well. I was impressed with the system's stability, as it was able to handle data from each of three hard disks being copied to the two pools simultaneously. I do still have those source hard disks and so probably can reconstruct much of the data. I was only a day or so away from executing destructive badblocks tests on those source drives such that they could be reused in a new vdev. Scary.

Both pools were created using the same version of ZFS. One NTFS drive had been mounted using ntfs-3g but that drive was not in use. A simple file copy task (cp -R /.../* .) was executing at the time of the reboot. Currently the pool01 disks are disconnected from the backplane. pool02 imported fine, is online and undergoing a scrub.

So although I was using NAS4Free at the time and can probably reconstruct the lost pool I'd still really like to know why this problem occured - if you're willing. Without knowing that I cannot be sure to avoid the same problem in future and actually really lose everything. If not then I will understand.

cyberjock · Aug 1, 2014

If you are using NAS4Free I'm hesitant to provide any support. Mostly because I don't know how their OS works "behind the scenes". The best and pretty much "only" recommended solution is to create the pool in FreeNAS and use it only in FreeNAS. Anything else is basically 'unsupported'. If you used the pool on NAS4Free and they have some kind of ZFS bug I won't know about it so we'd go running around in circles with no clue what we are really even looking for. FreeNAS makes certain design decisions with how it creates, maintains, and handles pools. I can't even validate that those basic design decisions are the same, let alone anything else.

So... do you remember this?

cyberjock said:
Every single person I've seen lose data has done it because they made at least one really big mistake. Most have made multiple stupid decisions. Nobody has ever done everything right and still lost data that I have seen on this forum.

I'd say you just fell into the "made a really big mistake".

The really big mistake being mixing FreeNAS and NAS4Free.

trionic · Aug 1, 2014

Ahhh... actually both pools were created with NAS4Free. Sorry for not being clear. I've read a few threads on various forums about switchint between BSD/ZFS-based NAS software and the advice always is "don't do it!". So I didn't :)

I'll post this on the NAS4Free forums and see if anyone can figure out what went wrong and how that can in future be avoided.

cyberjock · Aug 1, 2014

So I'll say this.. if you ever used the pool on FreeNAS the NAS4Free forum is likely to tell you the opposite of what I'm arguing. They're likely to say "well, we don't do FreeNAS and we don't know how their OS might have broken your pools.. so good luck!"

Gotta decide on an OS, stick with it, and not try criss-crossing OSes with the same pool.

gpsguy · Aug 1, 2014

In addition to what cyberjock said, if you decide on using FreeNAS, resist the temptation of doing things from the command line, if the can be done from the webGUI.

FreeNAS is designed to run as an appliance. Doing things outside the GUI can lead to problems.

DaPlumber · Aug 1, 2014

I am a System, Network, OS, and ZFS Badass.

:p

I am NOT a FreeNAS Badass. (Paging Cyberjock, nice doberman, good doggy... :D)

I like to tinker and I have had pools fail when I did bad and/or (sometimes deliberately) stupid things with them. "Stupid" in this context means not only contravening the stick-ied wisdom here but also some weapons grade stupidity of my own devising. (See "ZFS Badass" above.) However all those failures were on test pools or with data that had a verified copy elsewhere before the "Hold my Beer and Watch this!" moment. My comment (and I do have one) is that ZFS (and for the most part FreeNAS) is SUPPOSED to be an all or nothing affair. If the damage is recoverable then recovery should be as automatic and painless as possible, ideally you should only notice a few log entries and respond to whatever needs fixing, not have to sweat over fsck, chkdsk, etc. Pretty much anything you can think of that will result in the loss of a zpool would kill any other storage system too. The best example is probably loss of a VDEV. I don't know of any other storage system that will keep storage online after the loss of a RAID group. Even in this case with a dead VDEV, data that was in the pool and not on that VDEV is still technically recoverable, but you're now into FORENSICS which is a painful, skilled, and expensive exercise no matter the system. If you're truly interested in ZFS forensics you might want to start by having a look here: https://www.joyent.com/blog/zfs-forensics-recovering-files-from-a-destroyed-zpool (here be mighty dragons, you have been warned!)

When a zpool is unmountable it's like when an ER doctor calls Time of Death: They did everything they could, but it's over bar the paperwork and the autopsy if there's a reason. Fortunately in the storage world we get to break out the backups. After all there are two kinds of people: Those who back up religiously and those who haven't lost enough data yet.;)

Knowltey · Aug 10, 2014

trionic said:
Right now, on the verge of my first ZFS build, I am not comfortable with that consequence. I know, mirrors, backups etc, but some of the reasons for pool loss seem more and more to me like design flaws. For example: loss of a VDEV = pool loss; corrupted system metadata = pool loss. ZFS does not degrade gracefully.

Yeah, and both of those things would also cause complete loss on any other filesystem. Terminology may be different. Instead of VDEV you'd be talking about losing enough disks in a RAID for it to not be recoverable. (= total loss) and a corrupted filesystem on any filesystem means total loss. No usable filesystem = no usable filesystem no matter if it's NTFS, ZFS, or FAT.

The no recovery tools bit? That's just simply a market thing. Tons of NTFS recovery tools exist because it's not only used in server, but also in desktops, and in a vast number more machines than ZFS. ZFS however is targeted at servers, where if you're operating a server your should have proper backups in place and not need recovery tools. For the idiots that don't have proper backups in place the numbers are so much fewer than those needing to recover their NTFS or FAT volume that the demand simply hasn't warranted anybody to make a ZFS recovery utility. It would technically be possible, but it would take a metric shit ton of man hours and the demand simply isn't there.

So really none of those things are design falws as they aren't part of the design, they are simply vulnerabilities in the nature of filesystems. The chance of losing an entire NTFS filesystem on desktop hardware is just as great as ZFS, you're just going to experience selection bias here because this forum focuses on ZFS and on troublesooting, so if someone experiences a ZFS filesystem loss, it'll likely be to here they come. The guys that experience NTFS filesystem loss aren't coming here for you to see as much of.

We're big on the ECC and server hardware requirements because you should be running that on any server you're running, not just ZFS based ones.

trionic · Aug 15, 2014

To be clear: when I switched between FreeNAS and NAS4Free it was with totally blank hard disks ('dd if=/dev/zero of=/dev/da*'). No criss-crossing.

As it turns out I figured out the problem and recovered the pool. The whole thing was unremarkable and the cause obvious (in hindsight).

Now, I have read loads of pool-recovery threads on here and often the advice is to remove/add disks one by one until a failure disappears/occurs. Based on that I had already tried the following steps but without success.

I noticed that the spontaneous reboot occurred when an attempt was made to mount /dev/da6 and so I removed that drive from the chassis and rebooted. NAS4Free booted up fine.

'zpool import' showed pool01 present but degraded. I successfully imported the pool using 'zpool import -f pool02' (I know using -f is a last-resort but the data on this pool was disposable so I had nothing to lose).

I then put /dev/da6 into a different chassis and booted up the excellent SystemRescueCD Linux distro. 'smartctl -A /dev/da6' showed:

Code:

200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 1

So the cause of the spontaneous reboots was a (nother) faulty brand new Western Digital Red (which initially tested fine). Disappointing that such a failure could cause an operating system meltdown. Whether that's down to ZFS or NAS4Free I don't know.

There were files on the pool that I needed and so I recovered those. I could have stuffed a 4TB cold spare drive into the chassis and resilvered but by that point I had changed my plans regarding NAS4Free.

This experience has forced me to acknowledge and act on something that had been bothering me since I decided to switch from FreeNAS to NAS4Free: support in the event of catastrophy.

I hope that everyone picked-up on what happened earlier in this thread. Just a few hours after I had posted news of pool01's demise, cyberjock appears, ready to apply his expertise to recovering my data. That's the FreeNAS forums experience.

For the NAS4Free forums experience, you get pleas for help that (a) go unanswered (b) fizzle out without a solution or (c) "sorry dude ur datas gone"-type reponses:

From thread is my ZFS pool unrecoverable:

"but i don't think anyone can recover ur pool."

"ZFS pool information is damaged on ur hdds. Try google "zfs fschk"."

And the one that sent a chill down my spine:

"i doubt there is any person on this forum that can help you."

I am not bashing NAS4Free and I know that the FreeNAS vs NAS4Free comparisons have been done to death. However, I will say that I prefer the NAS4Free interface. The pool-creation steps better match the ZFS literature's description of its component parts. The UI is quick and simple.

However I just cannot ignore the vast difference in community expertise between FreeNAS and NAS4Free. I greatly appreciated the no-nonsense advice received in my build thread (and then felt guilty when I'd decided to use NAS4Free instead).

When my data is at risk from a ZFS problem then it's the likes of cyberjock and company that I need on my side (what happened to ProtoSD?). That support should be a major consideration for anyone who is evaluating the choice between these two products.

I recommend reading this epic recovery thread: Please help! Started up FreeNAS, suddenly voluime storage (ZFS) status unknown?!.

Knowltey said:
For the idiots that don't have proper backups in place

So just how are the home users here economically backing-up 96TB RAW/64TB actual of data? Or maybe they don't. Invest in a mirror/tapes or take the risk of having no backups.

Knowltey said:
The chance of losing an entire NTFS filesystem on desktop hardware is just as great

Which, on a different (pro-sumer) "server", I am finding out as yet again just a few days ago a 3TB NTFS disk suddenly becomes corrupted. Eight Memtest86+ passes reveals no bad memory (against my expectations) and now the finger of blame points to those Startech SATA cards... (note: this is not my ZFS server!)

Knowltey said:
We're big on the ECC and server hardware requirements because you should be running that on any server you're running, not just ZFS based ones.

Again, something I have come to appreciate just lately after a long line of corrupted NTFS disks. Is the ECC memory path constrained to i3s and Xeons?

We live and learn...

jgreco · Aug 15, 2014

trionic said:
I am not bashing NAS4Free and I know that the FreeNAS vs NAS4Free comparisons have been done to death. However, I will say that I prefer the NAS4Free interface. The pool-creation steps better match the ZFS literature's description of its component parts. The UI is quick and simple.

However I just cannot ignore the vast difference in community expertise between FreeNAS and NAS4Free. I greatly appreciated the no-nonsense advice received in my build thread (and then felt guilty when I'd decided to use NAS4Free instead).

No need to feel guilty, most of us here just want to see happy users with data stored safely. We occasionally even tell people FreeNAS isn't the product for them.

When my data is at risk from a ZFS problem then it's the likes of cyberjock and company that I need on my side (what happened to ProtoSD?). That support should be a major consideration for anyone who is evaluating the choice between these two products.

Not to downplay the contributions of any community members, but relying on the free community help has some risk. ProtoSD, myself, and others have kind of come and gone for various reasons. I'm currently working a contract that has eaten most of my spare time, for example. A more compelling argument for FreeNAS might be that it has iXsystems behind it, so in the worst case scenario you could probably get a support contract from the people who are writing the stuff.

Eight Memtest86+ passes reveals no bad memory (against my expectations) and now the finger of blame points to those Startech SATA cards...

Yup. See http://forums.freenas.org/index.php?threads/so-you-want-some-hardware-suggestions.12276/ where I discuss 99c bargain bin cards dropping bits.

Again, something I have come to appreciate just lately after a long line of corrupted NTFS disks. Is the ECC memory path constrained to i3s and Xeons?

No, but the average home user lacks the ability to do comprehensive testing of random combinations. You probably don't even have bad memory sticks sitting around. That means you won't actually know for certain how your choices stack up, and how (or even if) they log and report errors. So home users are best off not just picking random hardware and hoping for the best.

We live and learn...

Indeed.

Knowltey · Aug 15, 2014

jgreco said:
You probably don't even have bad memory sticks sitting around.

Ooh, but I do, managed to grab a DIMM with a reliable bitflip at a certain memory location from a previous job. (Not intentionally mind you. They were doing hardware recycling and offered me a tower so I took it since I was going to use it as my offsite backup server)

Ended up finding the reliable bitflip when running the MemTest86 stress testing period on it. (So there is a great example of why the testing period is important.)

At the time I put it aside because I was like "Perhaps in the future I can use this to make some good example of why ECC RAM is so important." But then later realized that I don't know how to perform such testing to produce said examples lol.

Ironically I still need to get that server on ECC, thankfully thought the motherboard in it (surprisingly) does actually support ECC RAM according to the BIOS on it since there is a status field in there that says whether the DIMMs are or are not ECC RAM. The production server has been more of the priority for the upgrade to ECC RAM for me, but I did just order the parts for that today, so that milestone should be reached now once I get the parts.

jgreco · Aug 16, 2014

Knowltey said:
Ooh, but I do, managed to grab a DIMM with a reliable bitflip at a certain memory location from a previous job. (Not intentionally mind you. They were doing hardware recycling and offered me a tower so I took it since I was going to use it as my offsite backup server)

Yeah, I know you work in IT, but the comment was aimed at trionic. If what you got wasn't ECC memory then the it is useless for the general thing I was thinking of: identifying how a system reports errors. If it wasn't ECC then the system cannot self-detect the error, so it'd never be able to self-report it. IT people do have a tendency to collect these sorts of fun things though :)

Some lesser boards will offer ECC support but may simply correct errors that are identified without reporting them. You want errors to be logged, because it represents your opportunity to be aware of a fail-y module without an actual failure occurring. Then you proactively replace or RMA the module prior to it potentially getting worse.

One of the reasons I advocate for server-grade boards, especially from a manufacturer such as Supermicro where they specialize in that, is because ECC logging and reporting represents one of those easily-overlooked aspects of a board design. Would a board from EVGA or Biostar support ECC? And if so, log and report ECC errors? Server use isn't their focus, and the typical consumer doesn't put ECC in desktops. I don't know for sure that they don't, but I'd be skeptical, and I'd want to test! Past history indicates that Supermicro DOES support ECC, though I like to test new models anyways.

As an IT person you're probably familiar with artificially creating failures in a testing environment to see how the system handles problems prior to putting something in production, and basically you just want to take that sort of strategy with RAM and ECC error detection.

Knowltey · Aug 16, 2014

jgreco said:
Yeah, I know you work in IT, but the comment was aimed at trionic.

As an IT person you're probably familiar with artificially creating failures in a testing environment to see how the system handles problems prior to putting something in production, and basically you just want to take that sort of strategy with RAM and ECC error detection.

Oh yeah, I know I just felt like popping in with that.

Thing I was thinking more or less though would be to flip a bit at a certain point or something and be able to be like "we slipped a bit in the RAM that was holding what-not information and this is what happened."

As an IT person you're probably familiar with artificially creating failures in a testing environment to see how the system handles problems prior to putting something in production, and basically you just want to take that sort of strategy with RAM and ECC error detection.

To an extent, most of my IT experience is focused on the "after things go wrong" part though since most of my professional IT jobs have been troubleshooting positions rather than project or deployment positions. (Have had one project planning position though as well as another with some minor deployment experience)

Currently doing a server troubleshooting position, which interestingly enough my experience with running FreeNAS at home helped me get. (Not that it's at all FreeNAS related, just that it was at least some experience with running a server)

SLIMaxPower · Aug 16, 2014

Like everything else always have a backup.

I have been using freenas + ZFS since 8* using HP Microservers. (N36l & N40l) During that time 8* had some issues with errors but I put down to 8* Since 9* I have not had a single issue even with power outages.

Currently running 24x7 a n36l 8gb ecc ram 7x3tb WD greens ( 2 pools raid 0 ) 1x2tb green (jails - plex, own cloud, sabnzb, VPN, Debian, sick beard). Will be moving to 16gb ecc this week. I rsyc locally weekly to a n40l same specs as n36l apart from the jail pool.

Have not had one issue. Have had several power failures and now issues following. Very happy with my current low-power setup.

gpsguy · Aug 16, 2014

Get a UPS. One of these days, you won't be able to recover and with RAID0, you'll be dependent on your backup.

SLIMaxPower said:
Have had several power failures and now issues following.

Knowltey · Aug 16, 2014

SLIMaxPower said:
2 pools raid 0

OMFG WTF is wrong with you?

I mean you did good with getting good security with the ECC and ZFS combo, and then you toss the point of using either of those things right out the window by using RAID0

SLIMaxPower · Aug 16, 2014

Knowltey said:
OMFG WTF is wrong with you?

I mean you did good with getting good security with the ECC and ZFS combo, and then you toss the point of using either of those things right out the window by using RAID0

Maximum space and speed.

You did read I have 2 identical systems right?

Important Announcement for the TrueNAS Community.

How many of you have experienced catastrphic pool failure and why?

Contributor

Dabbler

Explorer

Inactive Account

Explorer

Inactive Account

Explorer

Inactive Account

Active Member

Patron

Patron

Explorer

Resident Grinch

Patron

Resident Grinch

Patron

Explorer

Active Member

Patron

Explorer

Similar threads