Data recovery in failed drive scenario

Heracles · Nov 7, 2023

Good that you can recover at least some of your data. Good news also to know that your next pool will be more robust. Be sure to also do your backups and to test the restore procedure. A backup that has never been restored is not functional yet.

With a more robust system, probability of failure will be reduced.
With functional backups, damage associated with an actual failure will be reduced.

With probability and damage both reduced significantly, your risk level will drop drastically, so your data will be much much safer.

sfatula · Nov 9, 2023

Heracles said:
Any single error will be unrecoverable

True of raid, not true of ZFS. You would simply lose a file most likely, as long as you have drives supporting tler et al.

Heracles · Nov 10, 2023

Considering the level of understanding of those who keep insisting about using Raid-Z1, I did not considered beneficial to start defining details like the multiple copies of some ZFS blocks that are kept by default or the option to increase all of them with copy=X.

It would let them believe that despite the warning, they are protected.

sfatula · Nov 10, 2023

Only with a backup (ideally multiple) are you actually protected. I don't care if you have raidz100 (if there was such a thing and there isn't of course). Losing a pool can be mostly painless with the right backups (and tested).

Heracles · Nov 11, 2023

sfatula said:
Only with a backup (ideally multiple) are you actually protected.

Backups are the last line of defense. They are absolutely essentials and I repeat it all the time, including in this tread. No single TrueNAS server, no matter how robust it is, can be more than a single point of failure. This is also why I have my signature and I often refer people to it.

Still, the fact that you have one more line of defense behind you does not mean you are justified to neglect the other things in front of it.

As an example, here, I once failed my annual restore test. I understood why and the root cause was about 6 months before that test. As such, I had no valid backups for that period. I fixed my problem and managed to restore a new backup after that but still, even I ended up without proper backup for a moment.

Fortunately, thanks to my multiple layers of defense, I did not had to handle a case down to that very last level of protection at that time. But should these backups have been my only well designed line of defense, I may well have loose some data then.

thalin · Nov 13, 2023

Hey folks,

Maybe final update here. I was able to get back all of the data I cared about using Klennet - except for all of the zvols for VMs, which Klennet thought were all bad. This makes sense, since they were probably being written to when the pool died. Luckily most of the stuff running in these VMs was also running in Docker, with the configuration stored as NFS mounts back into the datasets on the pool. I probably lost some stuff, but got back the vast majority of what I was worried about.

Anyway - I learned that I probably didn't even need to get the disk repaired - I think that Klennet recovered the data without using 2 disks (including the repaired disk). So, that was probably a waste of money, unfortunately. I will be getting the original media back from the recovery place in the next week or two, so I will probably try to get the pool back up with the original disks and see if I can get the zvols off of them, so we'll see if it was a total waste or not. If I can, I will report back just to provide info, but if not I won't.

Summary: Klennet worked like a champ. Highly recommend. It's UI is a pain, and if you have a lot of datasets with a lot of writes you're going to have trouble finding your most recent data, but in my case where I had most of my important-to-me data in one dataset with few writes, it was an easy recovery.

Ericloewe · Nov 14, 2023

Interesting data point, thanks for sharing.

Etorix · Nov 14, 2023

Glad you got your important data back. Thanks for the report. Put your zvols on a separate mirror pool rather on the (now safer) raidz2 pool for mass storage and all should be much better.

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

Glitch01 · Jan 4, 2024

Glad to hear you got your data back. I'm in the process of scanning for mine using Klennet. No other software got close to recovering the data from my deleted pools. It did take about 1.5-2 days to scan the drives from a deleted mishap that deleted the pools into a ZFS Raid0 array after a new motherboard/processor swap and virtualization OS install. Unfortunately no off-site data backup or clones of the data that is up-to-date (be the first thing I get in place after the recovery completes). Cheers, I hope others learn from our mishaps and make duplicate off-system backups. If not we can attest that Klennet works for recovering data from delete pools. Note: pools were deleted but a full disk wipe was not completed in my case.

Glitch01 · Jan 8, 2024

Few notes I wanted to add regarding Klennet. Purchase of Klennet at this time is priced at $399 USD plus applicable taxes, Jan 2024. The software only runs on Windows OS, fully operational as a demo for scanning purposes, requires internet access, purchase allows usage of the software for 1 full year.

It is recommended you scan only 1 pool at a time on the same system as an auto-save of the progress occurs at the end of stage 3/6. Multiple scans will wipe out the autosave file containing the metadata scan progress for the drive set and will affect file recovery for the drive set. You can run multiple scans to save time & rename the autosave file before it is overwritten by other scans, but you must end the process for that scan since stages 4,5,6 and recovery will reference the autosave. Once all 6 stages of the scans are complete, you can save the progress to an exported file.

Ericloewe · Jan 8, 2024

So, slightly janky but usable.

Glitch01 · Jan 8, 2024

Ericloewe said:
So, slightly janky but usable.

Unfortunately, I couldn't find any software capable of recovering deleted ZFS pools and I'm not skilled in the area of disk data recovery. The best option is to have up-to-date off-system copies of your files, which I'll be implementing after recovery. The last off-system data is 6 months old and no longer relevant.

Glitch01 · Jan 9, 2024

Here are the stages of scans:

Disk scan, first pass,
disk scan, second pass,
disk order analysis (the autosave is written after this is done),
object set naming and encryption processing,
object set analysis,
checksum verification

You have the option to skip stages 4,5,6, but 4 & 5 is where the actual data is pieced together for copying over to another media. If you skip to the final screen, you do have the option to resume the scan where it left off, but it'll take a good amount of time (30 minutes) for the resume to process. Currently scanning 6x 14 TB 7200 rpm drives (it gives you stats on each drive during the scanning) and I'm on day 3, stage 5 roughly 40% of the scan completed. The stage 6 verification seems to move along much faster than the rest of the stages during my testing.

Hope this helps and gives an idea of time frame you'll be looking at.

Glitch01 · Jan 10, 2024

This has not been a fun experience. I recommend scanning from a fresh boot as scanning from a previous scan can consume additional RAM possibly allocated from previous scans. The system I'm scanning from has 256 GB of ECC memory. The application displays cpu, ram usage, disk reads, and disk time. I continued scans from various testing I've done on the system and the application consumed all the available RAM on the system. While at 78% of stage 5 scanning, the system ran out of memory and the application posted "Out of Memory". Once you acknowledge the message the application will close, and you will lose any progress since the autosave created after stage 3.

If you see the RAM usage nearing or at the system's available RAM, you can skip scanning progress in steps 4, 5, or 6 to get to the result screen where you can save the current scanned progress. You can then go back and load the saved progress should anything go wrong and it will continue scanning from the saved point.

Glitch01 · Jan 11, 2024

Update - Scanning Stage 4 completed this morning consuming 220 GB of RAM. I skipped to the end and saved the progress. Attempting to resume the scan will retain the RAM reserved. I closed out the application and loaded the saved progress, and it resumed scanning in Stage 5. Closing the application will free up the reserved RAM. Stage 5 will probably take another full day; I'm thinking I'll be able to copy files tomorrow after which I get to repeat the process 2 more times before being done

Ericloewe · Jan 11, 2024

Ok, very janky. Extremely janky for a relatively simple case, from your description it's pretty close to something that zpool import -D could recover from.

Glitch01 · Jan 12, 2024

Wouldn't a successful export of the pool have to be completed for an import? The pools were deleted from the creation of a ZFS raid0 through an OS installation that striped all the drives on the system. The seeds were backed up in the config file which was imported successfully; the disks were identified for the pools, but the pool statuses showed offline and were inaccessible. If you have some reference material you could point me toward, I'd love to learn more about data recovery of ZFS data sets. Thanks

Ericloewe · Jan 12, 2024

Glitch01 said:
Wouldn't a successful export of the pool have to be completed for an import?

No, but it would need to have minimal damage (i.e. only zpool destroy was run).

Glitch01 said:
The pools were deleted from the creation of a ZFS raid0 through an OS installation that striped all the drives on the system. The seeds were backed up in the config file which was imported successfully; the disks were identified for the pools, but the pool statuses showed offline and were inaccessible.

The sequence of steps is unclear, but if you're interested, start a new thread and explain the situation step-by-step with as much detail as possible. You can ping me so I don't miss it.

Glitch01 said:
If you have some reference material you could point me toward, I'd love to learn more about data recovery of ZFS data sets. Thanks

Unfortunately no. Much of ZFS's design revolves around not letting it get that far, which does make recovery difficult if you have to piece things back together without the whole structure.

Important Announcement for the TrueNAS Community.

Data recovery in failed drive scenario

Heracles

Wizard

sfatula

Guru

Heracles

Wizard

sfatula

Guru

Heracles

Wizard

thalin

Cadet

Ericloewe

Server Wrangler

Etorix

Wizard

Resource - The path to success for block storage

Glitch01

Dabbler

Glitch01

Dabbler

Ericloewe

Server Wrangler

Glitch01

Dabbler

Glitch01

Dabbler

Glitch01

Dabbler

Glitch01

Dabbler

Ericloewe

Server Wrangler

Glitch01

Dabbler

Ericloewe

Server Wrangler

Similar threads