Data recovery in failed drive scenario

thalin · Nov 4, 2023

Hey folks,

I am in a not great situation where I have a raidz1 pool which has a failed disk. Normally this wouldn't be a problem - just replace the disk with a spare, let the pool resilver, and move on with life. However, in this situation that process didn't work out as well as I had hoped.

Basically, I aready had hot spares and as far as I knew the spare had been added to the pool and the pool had resilvered, but somehow I ended up with the "failed" disk still online in the pool. This persisted for quite some time, but then another disk died, and the pool wouldn't import due to "I/O error" and zfs told me to destroy and recreate the pool from a backup. Of course, I do not have a backup (yes, lesson learned, that mistake won't be repeated, please do not waste everyone's time telling me I should have had a backup). I had already bought some new much larger disks, and was in the process of copying data off to the new pool, but I didn't get finished doing that before this disaster struck. :(

Anyway, I would still like my data back, so I took images of all of the drives in the failed pool and found the drive that was toast (as it wouldn't come online at all, so I could not make an image of it), and sent it off to a data recovery service hoping they could image the disk and send it back to me, whereupon I could simply mount the drive images in a VM and mount the pool and recover the data.

I got the image back and it looks correct as far as I can tell, but I still get the same I/O error when trying to import the pool. Is there any way I can recover this data or am I out of luck?

Here are a few more details from trying to zpool import:

Code:

thalin@recovery:~$ sudo zpool import
   pool: stor
     id: 14530119514345764826
  state: ONLINE
status: The pool was last accessed by another system.
 action: The pool can be imported using its name or numeric identifier and
    the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
 config:

    stor                                       ONLINE
     raidz1-0                                  ONLINE
       5428dd92-88a5-11e8-9e0e-14feb5c73810    ONLINE
       59c71c4b-88a5-11e8-9e0e-14feb5c73810    ONLINE
       spare-2                                 ONLINE
         70112329-45fb-4468-b96d-eca86e60e09f  ONLINE
         38ccd909-913f-11e8-8547-14feb5c73810  ONLINE
       650a4922-88a5-11e8-9e0e-14feb5c73810    ONLINE
       6abbada0-88a5-11e8-9e0e-14feb5c73810    ONLINE
       7058c7c5-88a5-11e8-9e0e-14feb5c73810    ONLINE
     spares
       38ccd909-913f-11e8-8547-14feb5c73810
thalin@recovery:~$ sudo zpool import -f stor
cannot import 'stor': I/O error
    Destroy and re-create the pool from
    a backup source.

I am also aware of a Windows recovery tool called ReclaiMe Pro which I have an evaluation license for and am letting it chew through looking for ZFS data at the moment but it's really, really slow and I don't know that I trust it to resolve the problem either, so any pointers here at all would be very helpful.

Thanks for any help or advice anyone may be able to provide! Please let me know if you have any further questions or if there's any information I can provide which might help.

Heracles · Nov 4, 2023

thalin said:
a raidz1 pool which has a failed disk

thalin said:
I aready had hot spares and as far as I knew the spare had been added to the pool and the pool had resilvered

thalin said:
but then another disk died

So it looks like you did not configured monitoring and alert to be warned when a disk fails and the spare is engaged.

Redundancy without monitoring and reaction is as good as no redundancy. Slowly but surely, you will burn through all your redundant layers until the operations fail when next failure happen and there is no more redundancy.

As such, running Raid-Z1 without monitoring is as good as a single drive pool. Think about it : should you had a Raid-Z2 vdev instead and that one would have survive that second lost. Would you have noticed ? Would you have replaced the drives in time ? No. You would be on the edge until your next failure and would end up in the same place just a little later.

Also, a 6 drives-wide Raid-Z1 is not recommended at all. The probability to get more than 1 failed drive at a time is way too high.

And on top of that, Raid-Z1 is not recommended anymore. Once you got a failed drive, there is no more redundancy to protect the data while rebuilding. Any single error will be unrecoverable. Also, in order to rebuild a single drive, all of your 5 drives must be scanned. So you need to read 500% of your drive capacity without a single error happening because there is no more redundancy to detect and fix it. Considering that today almost all drives are bigger than 1T, Raid-Z1 is not appropriate for any drives.

So we are now up to 4 No-Go here :
No-Go #1 = Raid-Z1 (not enough redundancy)
No-Go #2 = Too many drives in Raid-Z1 (too high a probability for multiple failure)
No-Go #3 = Drives too large for Raid-Z1 (too high probability for errors during re-silvering)
No-Go #4 = Un-monitored systems / drives

And because you asked for, I will not mention No-Go number 5.

thalin · Nov 4, 2023

Thanks for your opinions, but this is entirely unhelpful commentary. I came here for advice on how to recover my data, not a 6-paragraph scolding which really boils down to only a couple of points of actual advice - have monitoring enabled and don't use raidz1. That said, I guess I will respond to both of those:

So it looks like you did not configured monitoring and alert to be warned when a disk fails and the spare is engaged.

I did in fact have monitoring enabled (hence why I had the spare resilvered into the array already). The drive that failed later did so without warning, and monitoring did not help me in any way. I got an alert that a disk failed, I turned off the machine to add a spare and resilver, and when I turned the machine back on the pool would not come back online. Monitoring did not help me here.

Also, a 6 drives-wide Raid-Z1 is not recommended at all. The probability to get more than 1 failed drive at a time is way too high.

I have not seen this documented anywhere. I built this particular pool several years ago, and as far as I knew then, raidz1 was fine and not really considered problematic. I have definitely heard more recently that drives larger than about 8TB are not good for raidz1, but this is the first time anywhere that I've seen someone say that raidz1 is unsuitable for larger than 1TB drives. The drives in question are 5TB.

Do you have any suggestions for how I can recover my data?

Ericloewe · Nov 4, 2023

Once you're in I/O error during import territory, things get rather tricky. Given the steps so far, I would not expect any luck with relatively standard procedures, but there's a non-zero chance that zdb can help. However, getting started with it is non-trivial and it's likely that your problem is not recoverable. Ideally, you'd want someone with experience using zdb to help you out...

Redcoat · Nov 4, 2023

There is a piece of recovery software for ZFS pools called Klennet - see https://www.klennet.com/zfs-recovery/ It has a free evaluation version.

As Eric did not not mention it in the post above, I suspect it may fall into the "relatively standard procedures" category to which he referred...

I have no personal experience with it - there has been the occasional +ve report here in the forums.

thalin · Nov 4, 2023

I'm trying out Klennet already, so I'll see if that finds anything and if so I'll cough up the dough to buy a license. Thanks!

Heracles · Nov 4, 2023

thalin said:
but this is entirely unhelpful commentary

From where you are, learning about what went wrong is basically the only help we can provide you.

You crossed way too many lines and you must now pay the price for that. The same way people have to when they used hardware Raid, when they present TrueNAS with a virtual drive from their hypervisor and more.

thalin · Nov 4, 2023

Heracles said:
From where you are, learning about what went wrong is basically the only help we can provide you.

You crossed way too many lines and you must now pay the price for that. The same way people have to when they used hardware Raid, when they present TrueNAS with a virtual drive from their hypervisor and more.

The thing is, you haven't actually given me any useful information about what did go wrong. I had monitoring. I had a fully healthy pool as far as I knew. Only one drive failed. That shouldn't have been enough to kill the pool.

If you could explain to me why losing one drive in the pool caused the pool to be inoperable, that might be useful!

So, the pool was (as far as I could tell) in a fully healthy state - a drive had had errors and a new spare had been resilvered to replace that drive, but I hadn't ever removed that original error drive from the pool - why did that leave the pool in a vulnerable state? If you could explain that, it would be helpful!

PhilD13 · Nov 4, 2023

What I see (I think) is the system you designed RaidZ1 with a hot spare worked as you designed it to. A drive failed and the hot spare took it's place. What did not happen is that you did not replace the failed disk which still resided in the pool as a failed disk and then detach the hot spare from use. Think of the spare as one of those car doughnut temp tires for temporary use until you can get a new real tire or in the case of drives, a new drive so the failed drive can be actually physically replaced. Since you lost a second drive in RaidZ1, ZFS thinks you lost 2 drives from a Raid Z1 pool where only one loss is acceptable for that type of pool.

winnielinnie · Nov 4, 2023

PhilD13 said:
What did not happen is that you did not replace the failed disk which still resided in the pool as a failed disk and then detach the hot spare from use.

Regardless, a RAIDZ1 vdev missing a single member should still be importable in a degraded state.

@thalin, what about using the -Fn flag to see if the pool can possibly be imported?

Ericloewe · Nov 4, 2023

Redcoat said:
There is a piece of recovery software for ZFS pools called Klennet - see https://www.klennet.com/zfs-recovery/ It has a free evaluation version.

As Eric did not not mention it in the post above, I suspect it may fall into the "relatively standard procedures" category to which he referred...

I have no personal experience with it - there has been the occasional +ve report here in the forums.

Have you heard of anyone ever who recovered one byte out of a dead-ish pool with that thing? I don't think I have...

PhilD13 said:
Since you lost a second drive in RaidZ1, ZFS thinks you lost 2 drives from a Raid Z1 pool where only one loss is acceptable for that type of pool.

Surely not, for starters ZFS would say something like "insufficient replicas", not to mention that the import preview shows no issues.

Redcoat · Nov 4, 2023

Ericloewe said:
Have you heard of anyone ever who recovered one byte out of a dead-ish pool with that thing? I don't think I have...

The simple answer is "NO" - I don't think I ever saw a detailed granular description of either a recovery or a failure. I do believe, though, that I read of both apparent successes and failures.

thalin · Nov 4, 2023

winnielinnie said:
Regardless, a RAIDZ1 vdev missing a single member should still be importable in a degraded state.

@thalin, what about using the -Fn flag to see if the pool can possibly be imported?

I will give that a shot after Klennet finishes crunching, which might take a while (probably another day or so at least, maybe 2-3).

Ericloewe · Nov 4, 2023

I just realized you didn't mention which version you're using... Recent-ish versions have important improvements and big fixes that address some of the weird edge cases when importing pools. If you haven't, definitely try the latest version of TrueNAS or something with a very up-to-date ZFS.

thalin · Nov 4, 2023

Ericloewe said:
I just realized you didn't mention which version you're using... Recent-ish versions have important improvements and big fixes that address some of the weird edge cases when importing pools. If you haven't, definitely try the latest version of TrueNAS or something with a very up-to-date ZFS.

Ah, good point. I didn't go into any details as to what I'm doing to attempt to import the pool or how I'm going about my recovery efforts - maybe y'all can go over this and let me know if there are better or more efficient ways to do this.

First of all I'm using TrueNAS Scale 22.12.3.3 on a Supermicro H12SSL-i with a 7402p processor and 256GB of RAM, along with an LSI 9305i raid card (though I probably ought to just use the motherboard SATA ports since they're just directly wired into the CPU on this board as far as I can tell). I know there's been a TrueNAS update or two since then, but I haven't gotten around to applying them yet as aside from this situation I have a lot of other stuff going on in my life that means I can't spend as much time fiddling with stuff right now as I would like.

Currently I have all the disk images as simple dd images on the larger zpool I mentioned earlier (with a snapshot from before I started any recovery efforts, just in case I screw something up). Since some of the data on the old pool has already been copied to the new pool, I think I have enough room for the disk images and the data I want to copy out of them. I have these images mounted in VMs in TrueNAS, one Windows 10 for ReclaiMe Pro/Klennet and one Debian 12.

The Debian image is what I've been using to try to mount the pool directly - but maybe I should be using a virtualized TrueNAS instead. I had thought about that, but hadn't actually given it a shot yet. I know the general recommendation is to not virtualize TrueNAS, but since this is just a recovery effort I don't think it probably matters much. I'm not going to be writing any new data to these. I don't have the disks available right now to do this bare metal anyway, so it's kind of VMs or nothing.

So, some questions:

Would it better or faster in some way to convert the images to zvols instead of raw files for use in the VMs? I know TrueNAS (at least Scale, I assume Core as well) does everything in its VM system with zvols, but it was easier at the time for me to just dd into image files at the time, and I honestly didn't think of using zvols until later anyway. Are zvols more performant or reliable for this purpose? I assume so, but it would be nice to have some outside confirmation. Is it worth the effort to convert my image files to zvols?
Is it worth trying TrueNAS for recovery instead of Debian? I am only running one recovery VM at a time, so I can't really try this until I let Klennet finish what it's doing (and if that works I'll probably report back that it does and drop this topic so it will be moot anyway).

Thanks for your help, folks!

Ericloewe · Nov 5, 2023

thalin said:
Is it worth trying TrueNAS for recovery instead of Debian? I am only running one recovery VM at a time, so I can't really try this until I let Klennet finish what it's doing (and if that works I'll probably report back that it does and drop this topic so it will be moot anyway).

Not strictly, if you use the latest version of ZFS with Debian, that should be broadly equivalent to the latest TrueNAS.

Heracles · Nov 5, 2023

thalin said:
you haven't actually given me any useful information about what did go wrong.

I gave you just 5 of them... So let looks at the rest of your setup to find more... Is your system using ECC ? What is the HD controller and how is it configured ? When was the last time you ran preventive tests (SMART, long and short ; memtest ; ...) ? How often did you scrub your pool ?

Errors are to be expected in a system and this is why it is important to design and operate it expecting and dealing with these errors. You did not.

What are the bits and bytes of the last extra error that pushed you over the cliff ? It is not visible from what you posted. In all cases, from what you posted, it looks like it is too late and the recovery of the last failed drive is the last hope (already in progress according to you).

So best thing to do now is to learn about all of these mistakes and fix ALL of them, not just the single one that ruined your pool this time. The reason is even should you find that single error, all the others I pointed you will do the same if you do not fix them as well.

chuck32 · Nov 5, 2023

PhilD13 said:
What I see (I think) is the system you designed RaidZ1 with a hot spare worked as you designed it to. A drive failed and the hot spare took it's place. What did not happen is that you did not replace the failed disk which still resided in the pool as a failed disk and then detach the hot spare from use.

I tried to find some documentation on that and I'm not sure I understand the workflow / understand the reasoning behind it:

1) Drive in pools fails
2) Hot spare kicks in (needs resilvering of the pool)
3) replaced failed drive (another round of resilvering?) so the hot spare goes back to being an available hot spare

Rather I'd expect that the hot spare replaces the failed drive and you replace / add another hot spare after removing the failed drive. In this case you would save 1 round of resilvering.

thalin said:
I did in fact have monitoring enabled (hence why I had the spare resilvered into the array already). The drive that failed later did so without warning, and monitoring did not help me in any way. I got an alert that a disk failed, I turned off the machine to add a spare and resilver, and when I turned the machine back on the pool would not come back online. Monitoring did not help me here.

This sounds like a workflow without data loss to me:

First drive fails, hot spare takes over, second drive fails (no hot spare available) and machine is taken offline and a replacement disk is added. This reads like there was always a maximum of 1 failed drive at any give time.

Excuse the unhelpful comment but I myself and maybe future readers would like to have a better understanding of the failure so it can be avoided.

Heracles · Nov 5, 2023

Error may have happened during the re-silvering process. Because the structure was Raid-Z1 and one drive was completely dead, there was no redundancy to detect and fix any error.

If an error happened during the re-silvering process (non-ECC memory, flipped bit more or less recently depending of the scrub, ...) and that error happened to be in a critical part of ZFS' structure, then it can be enough to toast the pool.

We keep telling people about how Raid-Z1 will not protect them as they think it does. I think this is just a perfect example of this.

In the same way, we keep telling people how backups are not optional. Again, we have the evidence here.

thalin · Nov 6, 2023

Hey folks, some cautiously optimistic news here - it seems like Klennet is seeing files so far. I'm still waiting on it to work through the last few of its steps, so I'll update again once it has finished processing. Unfortunately I'm moving in a week so I think I'm going to have to wait until I get to the other side to actually recover the data, but signs seem good at the moment...

@Heracles I'm sure you'll be delighted to know that my new pool is raidz-2, the machine still has ECC, and the pool is scrubbed weekly. I'll be sure to set up some SMART online tests monthly or something.

Important Announcement for the TrueNAS Community.

Data recovery in failed drive scenario

Cadet

Wizard

Cadet

Server Wrangler

MVP

Cadet

Wizard

Cadet

Patron

MVP

Server Wrangler

MVP

Cadet

Server Wrangler

Cadet

Server Wrangler

Wizard

Guru

Wizard

Cadet

Similar threads