Failed FreeNAS RAIDZ1 5 with 4 disks@2tb with 2 disk failures

Status
Not open for further replies.

jtunstall

Cadet
Joined
Aug 4, 2018
Messages
1
hello, I am working on a freenas Raid 5 with two failed 2 tb disks out of 4.
I was able to image(DD) one of the failed disks 100% to an Identical 2tb disk.
so now i have three disks of the 4, yet the pool cannot mount because of two missing members using GUID tags and obviously one disk does not belong(different guid).
I attempted mounting the raid in ubuntu, but its response is it "can" mount it but does not.

is there any way around this....usually i would mount all disks in a virtual disk utility with , block size, raid algorithm and offset and then extract data.

thank you
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The GUID isn't a magical number unique to the disk, it's a random number that was generated by ZFS when the disk was added to the pool. It sounds like your copy was not performed correctly/successfully. Unless you can fix that, there's nothing that can be done.
 

M H

Explorer
Joined
Sep 16, 2013
Messages
98
Hope you have a backup, because it is likely the pool is gone. ZFS works wonderfully when set up correctly and proper precautions taken, but unfortunately, there are ZERO recovery tools. For future reference, no one uses RAIDZ1 on any modern arrays with large drive sizes (>2 TB) anymore. You essentially guaranteed an unrecoverable read error at those sizes when trying to resilver.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
You essentially guaranteed an unrecoverable read error at those sizes when trying to resilver.
Nonsense. "Increased risk"? Yes, definitely. "Essentially guaranteed"? No, not even close, and there isn't the faintest hint that this is what's happened to OP's pool. We don't need FUD to advise people against RAIDZ1.
 

M H

Explorer
Joined
Sep 16, 2013
Messages
98
Nonsense. "Increased risk"? Yes, definitely. "Essentially guaranteed"? No, not even close, and there isn't the faintest hint that this is what's happened to OP's pool. We don't need FUD to advise people against RAIDZ1.

Why are all your arrays Z2 then? I'm not saying a failure occurred, just advising in the future that he should be using more parity disks. I lurk the forums regularly, but as soon as I post anything, I'm quickly reminded as to why everyone avoids the attitude on these forums.
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Why are all your arrays Z2 then?
Because it's good practice. "Avoid RAIDZ1" is good advice. "You essentially guaranteed an unrecoverable read error at those sizes when trying to resilver" isn't; it's FUD--yes, the risk is increased, but it isn't anything close to "essentially guaranteed". It's like the "you need ECC with ZFS because of the Scrub of Death." In both cases, it's using bad reasoning (greatly exaggerating the risk of a problem) to promote a good thing.
 

M H

Explorer
Joined
Sep 16, 2013
Messages
98
Because it's good practice. "Avoid RAIDZ1" is good advice. "You essentially guaranteed an unrecoverable read error at those sizes when trying to resilver" isn't; it's FUD--yes, the risk is increased, but it isn't anything close to "essentially guaranteed". It's like the "you need ECC with ZFS because of the Scrub of Death." In both cases, it's using bad reasoning (greatly exaggerating the risk of a problem) to promote a good thing.

I'm not trying to scare anyone. I'm just simply saying that RAID5 and Z1 are dead with today's drive sizes. That's all. And again, not using ECC is a big no no with ZFS for the reason you stated. IMO you're downplaying the risks in both cases.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
And again, not using ECC is a big no no with ZFS for the reason you stated.
If "the reason (I) stated" is the scrub of death, it's a myth.
IMO you're downplaying the risks in both cases.
I'm downplaying the FUD-level risks often presented here, that (1) you're guaranteed a read error on resilver (false), which will destroy your pool (also false), if you use RAIDZ1 with > 1-2 TB disks; and (2) the Scrub of Death(tm) will destroy your data if you use ZFS without ECC RAM. They simply aren't true. Yes, RAIDZ2 is a good practice, and strongly preferred over RAIDZ1. Yes, ECC RAM is a good idea. I routinely recommend them, and I put my money where my mouth is. But the degree of risk is grossly overstated by some folks here, and I don't generally let that go without addressing it.
 

M H

Explorer
Joined
Sep 16, 2013
Messages
98
I'm genuinely curious now. How wouldn't a stuck memory bit trash a pool during a scrub?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
And again, not using ECC is a big no no with ZFS for the reason you stated.
ZFS is no worse than any other filesystem out there when it comes to non-ECC RAM. Thing is, if you want ZFS, you probably want ECC, too. It just makes sense to have that extra reliability and correctness.

As for RAIDZ levels, RAIDZ1 is in a weird spot, because its reliability is poor and so is its performance. Simple mirrors are about as reliable as RAIDZ1, but benefit from better performance and just having a more favorable data/redundancy ratio. RAIDZ2 and 3 achieve significantly better reliability at little loss of performance and really make RAIDZ1 a poor choice in most cases.

I'm genuinely curious now. How wouldn't a stuck memory bit trash a pool during a scrub?
After a few errors, ZFS will give up on a disk. Once it gives up on enough disks, the pool in UNAVAIL and nothing else happens. So, some corruption is possible. Thing is, you need a "favorable" combination of memory errors and physical memory locations used by ZFS for something to actually be "corrected" into a wrong state.
Chances are, if your memory is bad enough to completely make ZFS go haywire, the system will have panicked well before any sort of I/O is accomplished.
 

M H

Explorer
Joined
Sep 16, 2013
Messages
98
ZFS is no worse than any other filesystem out there when it comes to non-ECC RAM. Thing is, if you want ZFS, you probably want ECC, too. It just makes sense to have that extra reliability and correctness.

As for RAIDZ levels, RAIDZ1 is in a weird spot, because its reliability is poor and so is its performance. Simple mirrors are about as reliable as RAIDZ1, but benefit from better performance and just having a more favorable data/redundancy ratio. RAIDZ2 and 3 achieve significantly better reliability at little loss of performance and really make RAIDZ1 a poor choice in most cases.


After a few errors, ZFS will give up on a disk. Once it gives up on enough disks, the pool in UNAVAIL and nothing else happens. So, some corruption is possible. Thing is, you need a "favorable" combination of memory errors and physical memory locations used by ZFS for something to actually be "corrected" into a wrong state.
Chances are, if your memory is bad enough to completely make ZFS go haywire, the system will have panicked well before any sort of I/O is accomplished.
Thank you very much for this explanation. So you still have to be incredibly unlucky for the corruption to occur in a pool-trashing location.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
For most errors (single-bit errors), I'd expect transient errors that end up being compensated by parity, leading to things just kinda sorting themselves out.

Note that this is all fairly hand-wavy and wishful thinking, so ECC is still the proper way to go if you care about your data.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I'm genuinely curious now. How wouldn't a stuck memory bit trash a pool during a scrub?
The link I gave lays it out in more detail, but in short, this is would would have to happen during a scrub:
  • ZFS reads good data from your disk, along with its checksum.
  • Due to bad RAM, either the data or the checksum is read incorrectly, such that they don't match.
  • ZFS goes to parity/redundancy after good data to replace the data it thinks it read incorrectly.
  • Due to bad RAM, it reads both the data and its checksum incorrectly, in such a way that the bad data and the bad checksum match. This isn't quite as astronomically unlikely as the link suggests, as SHA256 isn't used by default, but it's still extremely unlikely.
  • Since the bad data and the bad checksum match, ZFS thinks the data's good, and overwrites the good data it initially read.
It's an extraordinarily unlikely sequence of events. If you want to make it even more unlikely, you can tell ZFS to use SHA256 or another crypto-secure hash.

As to a URE on resilver trashing the pool, ZFS protects against that too, since it's both the filesystem and the volume manager (and therefore knows which sectors on the disk are being used for what). If there's a URE in data, and there's no redundancy or parity to fix it, the affected file is damaged, but the pool is fine. If it's in metadata, ZFS keeps at least two, and up to six, copies of all metadata, before accounting for parity or redundancy--and it's all checksummed as well, so ZFS knows if it's getting good data or bad.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If you want to make it even more unlikely, you can tell ZFS to use SHA256 or another crypto-secure hash.
On 11.0 and newer, you'd probably want skein or SHA512, since they're faster on 64-bit CPUs. SHA512 is a straightforward improvement over SHA256 and skein was one of the finalists for the SHA-3 standard, and was designed to be a bit faster on 64-bit CPUs - which it is, compared to SHA512, so it's probably the better option.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Of course, there's a much simpler way that non-ECC RAM can trash your data, and that's by corrupting it before it's written to disk. That's by no means unique to ZFS, but ZFS isn't immune to it either.
 
Status
Not open for further replies.
Top