jgreco
Resident Grinch
- Joined
- May 29, 2011
- Messages
- 18,680
The way Proxmox "wiped" the drive
I should write a book, "101 reasons not to invent your own virtualization strategy." Sigh.
The way Proxmox "wiped" the drive
It was kinda a confusing thing to follow. So what happened was one of my drives disappeared so I thought it was dead. I bought a new one, took out the old, and slapped in the new. I when to format the new drive, and I accidentally formatted the wrong one. So at the time, there was only one drive that was good. and then I had the new drive and the drive I accidentally wiped. so I took out the new drive and put in the one I thought was dead, and it showed up. Now I have two good drives and one that is wiped. Hope this clears it up. So zpool import isn't lyingYou accidentally "wiped" a good drive in the RAIDZ1 vdev of a pool that was already degraded.
I'm not sure why "zpool import" is suggesting that two drives are available, when reality you only have one available.
First...
RAIDZ1 (HEALTHY):
Drive A - good
Drive B - good
Drive C - good
Then...
RAIDZ1 (DEGRADED):
Drive A - failed
Drive B - good
Drive C - good
Then...
RAIDZ1 (DEAD):
Drive A - failed
Drive B - wiped <--- the mistake you made when wiping via Proxmox
Drive C - good
What's throwing everyone off is that "zpool import" without any flags suggests that two drives are healthy and available, which implies you can import the pool in a degraded state.
Yet this is not true.
Why is "zpool import"... lying?
My super amateur low-IQ shot in the dark: The way Proxmox "wiped" the drive perhaps just erased a portion of it at the start of the drive? Yet there still remains zpool metadata at the end of the drive, which zpool import detects?
Honestly, this is getting into low-level stuff, and I'm just shooting in the wind.
I would totally run it on bare metal, but it is just for my home server, and I don't want to spend money on 2 servers. It's been working flawlessly until now. :( I need this server PC to run other vms like home assistant, pie hole, and some game servers that I host.I should write a book, "101 reasons not to invent your own virtualization strategy." Sigh.
Instant.How long did it take to "format" or "wipe" the drive?
That's probably why the "wiped" drive is still detected as "available" when you run "zpool import" without any flags.Instant.
At this point, it might be too late. You may have in fact killed (accidentally) your pool by issuing Proxmox's "wipe" tool against the wrong drive before you did the replacement of the actual failed drive. (See my color-coding in an earlier post of the series of events. I'm only going by what you said in regards of "I wiped the wrong drive.")I would totally run it on bare metal, but it is just for my home server, and I don't want to spend money on 2 servers.
I would totally run it on bare metal, but it is just for my home server, and I don't want to spend money on 2 servers.
It's been working flawlessly until now.
:( I need this server PC to run other vms like home assistant, pie hole, and some game servers that I host.
Definitely, many lessons were learned over this whole ordeal. I will be taking your advice. I greatly appreciate your help and time through this. If you come up with any other ideas or know of anyone who might have more ideas please feel free to post them here. I will leave the system as is for the next couple of days in case of that. Otherwise, I will bite the bullet and start over.That's fine, but you need to do it correctly.
That's the thing I hear all the time. It works flawlessly until it suddenly doesn't. That can be either a technical server issue of some sort (PCIe passthru flakes out) or an operational issue facilitated and encouraged by some design error (for you, right now, "I erased the wrong drive with Proxmox", to which the question is, "why the HELL did Proxmox have any access to the drives?")
I say it all the time, but the measure of success is NOT "I got it to do this thing that I wanted to do" but rather "I used a strategy that has considered how to mitigate numerous potential failure modes and has been successfully used by thousands of people". This is intended as constructive criticism because it is coming closer and closer to looking like you will not be recovering your data, so we can look forward to a better design for your next TrueNAS.
Even if one of the other ZFS wizards hanging around here manages to incant magic to fix you, the failure here is bad.
Yes, and maybe Proxmox is okay for that, but you could also run the VMs directly under SCALE as well. SCALE would work well for a handful of VMs. If you're going to run SCALE under Proxmox, you really need to follow the virtualization guidance I posted earlier in this thread. It's a pick-yer-poison sort of decision and I don't have a horse in that race. If Proxmox will be stable on your platform, then you're probably good to go that way if you wish. But then you need to use PCIe passthru for a HDD controller of some sort.
I appreciate the advice. I have already tired thisSomething you have not tried is TrueNAS Core. While I do not think you will have better luck, there is a remote chance it will work, very remote. Maybe you can mount the pool. But before you accept there is no data recovery, give it a try. I'd also just do it on bare metal for the heck of it. You can create a bootable USB Flash Drive for this and disconnect all drives not part of the pool. I would also remove the "formatted" drive and install the new blank drive. The formatted drive is useless in my opinion for data recovery unless you desire to pay big money to have a company recover the data for you, if they can that is. Again, this is what I'd do but I try to think outside the box.
Some advice for the future: If you are going to do something risky to a drive, remove the good drives first. Keep the good drives safe. Trust me, we all have had our fair share of mistakes and loosing data. My first lesson was at age 16 (back in the 1970's). I had several other lessons unfortunately after that but I haven't had one in many years (knock on wood) and I hope to never run into another data loss due to me not being careful.
RAIDZ1 (DEAD):
Drive A - wiped <--- the mistake Imade when wiping via Proxmox
Drive D - new drive
Drive C - good
zpool import -D
Drive A - wiped <--- the mistake Imade when wiping via Proxmox
Hey, @HoneyBadger ... do you have any good recovery suggestions here? This feels like there should be something obvious but I really almost never have to recover ZFS pools, so my Zfu is weak in this area.
zdb -l /dev/adaX
for each of ada0/1/2 and put the output into [code][/code]
tags. Identify which disk is which from "Drive A" and "adaX" if you can.Yes, you followed that perfectly. Just to confirm I will need to buy another new 4tb? if so, I'll get that ordered today. Also, any cloning and partition rebuilding software you'd recommend? I've never used any sort of drive recovery software. I do have a backup I could restore to, that has the old zpool.cache file. This sounds like a solid plan, you are the man! Just one thing, I'm not sure how "non-preferred method #2" is any different than what we've been trying (other than the fact that you'd have me clone the drives to different drives). Or am I not following correctly? THANK YOU for your help!Oofda. I think I'm caught up on where things stand right now, I'll see if I can add anything helpful.
@Dawson the issue hopefully is described correctly below:
RAIDZ1 (HEALTHY):
Drive A - good
Drive B - good
Drive C - good
(Obviously this is fine.)
RAIDZ1 (DEGRADED):
Drive A - good
Drive B - failed <---- *so i thought
Drive C - good
(At this point ZFS is writing to both Drive A and Drive C, and transaction counts are increasing. Drive B is offline and is NOT increasing its transaction count.)
RAIDZ1 (DEAD):
Drive A - wiped <--- the mistake Imade when wiping via Proxmox
Drive D - new drive
Drive C - good
(The pool is now unavailable because of lack of replicas.)
RAIDZ1 (DEGRADED): <---- current state
Drive A - wiped
Drive B - good <-- replaced new drive with old "failed drive" (that didn't actually fail)
Drive C - good
(At this point Drive A is unusable, Drive B is "living in the past" several dozen/hundred transaction groups behind, and Drive C is in the present.)
Drive B and C are both "good" in that they are physically functional and have pieces of a working ZFS RAIDZ1 pool, they just disagree (potentially by a large margin) what time it is, so we're now into applied temporal mechanics.
The way I see it we have two methods to attempt recovery. Both of these require an additional brand new 4T drive for optimal safety - you'll use this drive and the previous one you bought to replace the not-failed Drive B as blanks to clone.
1. Preferred Method: Clone A + C to new blank drives. Try to rebuild and restore the quick-formatted partition table/gptid labels on Cloned Drive A, attempt to import with CloneA+CloneC "in the present" and then resilver with Drive B. Hopefully no/minimal data lost.
2. Not-preferred Method: Clone B + C to new blank drives. Attempt to import CloneB+CloneC "in the past" to the point where Drive B fake-failed. Data added since then is discarded.
Let's see what we can make of this. It looks like the zpool.cache file was already deleted (unless you can pull a copy from a backup somewhere?) so let's try getting an SSH session, runningzdb -l /dev/adaX
for each of ada0/1/2 and put the output into[code][/code]
tags. Identify which disk is which from "Drive A" and "adaX" if you can.
If we've got valid labels saved on Drive A (IIRC, ZFS saves four copies) then we can try the first recovery method. If Proxmox somehow torched all four, then we're limited to the second.
We'll do our best here to help you out.
The reason I'm recommending the additional drives is out of an abundance of caution, the fact that it isn't my data at risk, and that your most recent backup is "rather aged" by your own admission. I'm hoping that we can find a way to get you back to at least a point more recent than that.Yes, you followed that perfectly. Just to confirm I will need to buy another new 4tb? if so, I'll get that ordered today. Also, any cloning and partition rebuilding software you'd recommend? I've never used any sort of drive recovery software. I do have a backup I could restore to, that has the old zpool.cache file. This sounds like a solid plan, you are the man! Just one thing, I'm not sure how "non-preferred method #2" is any different than what we've been trying (other than the fact that you'd have me clone the drives to different drives). Or am I not following correctly? THANK YOU for your help!
dd
- since you have Proxmox as a host OS which is Debian-based, you could use dd
from Proxmox, but again, be absolutely sure you are specifying the correct source and destination disks. If you DD an entire empty disk onto a good one, there's no walking that back.gpart list
the table from one disk and then manually create it on the other, ditto the label edits. If zdb
finds labels on Disk A that's a good sign though.-FX
switch so we're at the point of going manually spelunking for older transaction groups with zdb
and then trying to import the pool at that time using -T txg
.