HELP ZFS Pool data recovery

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Will do. They're all the same Toshiba MG03ACA400 4tb drives. I will start cloning then!
Fingers crossed. I've only been giving a small period of time between the "Drive B offline" and "Drive A wipe" in my testing, but the hope here is to be able to import your pool with the cloned/rebuilt A + live C rather than A+B+C, so the wider txg discrepancy shouldn't impact things. Assuming we can import successfully (in R/W mode) the scrub and resilver process will get the disks back in sync - at that point, update those backups. :wink:
 

Dawson

Explorer
Joined
Jun 17, 2023
Messages
80
Awesome. Just to confirm (Never used dd before)

sda = wiped drive
sdc = new drive

Is this command right?
Code:
dd if=/dev/sda of=/dev/sdc status=progress
 

Dawson

Explorer
Joined
Jun 17, 2023
Messages
80
S/N just for reference:

35Q7K5JBF - gptid/11b39573-ad95-11ed-8d1c-7df9cea98351 (Wiped drive)
56B8K29YF - gptid/11c0215d-ad95-11ed-8d1c-7df9cea98351 (Good drive)
54HFK1MCF - gptid/11bac542-ad95-11ed-8d1c-7df9cea98351 (Outdated Data Drive)
15H3K9R8F - New Drive #1
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Awesome. Just to confirm (Never used dd before)

sda = wiped drive
sdc = new drive

Is this command right?
Code:
dd if=/dev/sda of=/dev/sdc status=progress
That should do it. There might be some benefit in adding the bs (block size) parameter, such as:

dd if=/dev/sda of=/dev/sdc bs=1M status=progress

to encourage the program to copy in 1M increments rather than the default 512 bytes.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I'm waiting with baited breath and hoping for a recovery. Very interesting. I haven't manually manipulated partition tables in 30+ years. I'd use a disk hex editor. I was making device drivers at the time. But if this recovery works, it might be something worth a link to, but hopefully not really needed often.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I haven't manually manipulated partition tables in 30+ years.

It's pretty easy these days, I gave a plausible but unverified run at it earlier in the thread on a virtual disk. The trick would be to get the raw data out of an existing disk and make sure that your set of commands is equivalent to the existing disk. Partitioning does not destroy the existing data on a disk.

Code:
root@nas0:/mnt/storage0 # gpart show da3
=>         34  11721045101  da3  GPT  (5.5T)
           34           94       - free -  (47K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  11716850696    2  freebsd-zfs  (5.5T)
  11721045128            7       - free -  (3.5K)


Basically the important bit here is that the "freebsd-zfs" partition has to end up in the same location as whatever it is on the other disk. You can use "gpart list da3" (for ex) for more detail.

You basically do "gpart create -s GPT daX" to lay down a GPT style partition table, then "gpart add" with appropriate flags to add your partitions one at a time. Much better than the grungy old utilities we used in the old days.
 
Joined
Oct 22, 2019
Messages
3,641
Much better than the grungy old utilities we used in the old days.
You're referring to the chisel and slab? :cool:

good-old-days.jpg


nailed-it.gif



I'll see myself out...
 
Joined
Oct 22, 2019
Messages
3,641
@HoneyBadger and @Dawson

Is all this (dd, gpart, zdb) being done via TrueNAS Core on bare metal directly? (I didn't catch if Proxmox was temporarily taken out of the picture for the recovery procedures.)
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
@jgreco So I understand the partition table, but how will you know what the starting sectors will be for each partition? That I feel is the hard part, assuming the formatting did not destroy any of the data at he beginning of the freebsd-zfs partition. Is there a specific byte/word format you could search for to determine this? I'm not trying to stir up trouble, I genuinely am interested in this.

I'm not sure what kind of File Allocation Table is used for ZFS, I most certainly do not think it's the traditional FAT, but maybe FAT16 or FAT32. If that were overwritten then piecing the data together can be very time consuming, but it can be done.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
@jgreco So I understand the partition table, but how will you know what the starting sectors will be for each partition? That I feel is the hard part, assuming the formatting did not destroy any of the data at he beginning of the freebsd-zfs partition. Is there a specific byte/word format you could search for to determine this? I'm not trying to stir up trouble, I genuinely am interested in this.

Nobody would ever accuse you of stirring up trouble, Mister Human Tacos Guy.

Anyways ... it's trivial. When the pool was generated, the only variable here is the size of the swap partition. You look at the existing disks for the size of the swap partition. The start of the ZFS partition of each disk will be the same. You only need to duplicate this on the "erased" disk. Will it work? Maybe, maybe not, but if it doesn't work, it won't be because of a wrong location. It'll be because of missing data. As others have pointed out,

Code:
root@nas0:/mnt/storage0 # gpart show da2
=>         34  11721045101  da2  GPT  (5.5T)
           34           94       - free -  (47K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  11716850696    2  freebsd-zfs  (5.5T)
  11721045128            7       - free -  (3.5K)

root@nas0:/mnt/storage0 # gpart show da3
=>         34  11721045101  da3  GPT  (5.5T)
           34           94       - free -  (47K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  11716850696    2  freebsd-zfs  (5.5T)
  11721045128            7       - free -  (3.5K)

root@nas0:/mnt/storage0 # gpart show da4
=>         34  11721045101  da4  GPT  (5.5T)
           34           94       - free -  (47K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  11716850696    2  freebsd-zfs  (5.5T)
  11721045128            7       - free -  (3.5K)

root@nas0:/mnt/storage0 # gpart show da5
=>         34  11721045101  da5  GPT  (5.5T)
           34           94       - free -  (47K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  11716850696    2  freebsd-zfs  (5.5T)
  11721045128            7       - free -  (3.5K)

root@nas0:/mnt/storage0 #


They're all the same --- 4194432. Then you just use zdb -l to see if there's anything useful out there on the disk once you regenerate that partition table. Adding a partition does not change the contents of the disk sectors that make up the partition.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
That big, chunky 2G swap space also has the unintended-but-welcome effect of being a very large buffer against a hypervisor or other "disk wipe script" accidentally overwriting anything within the actual ZFS filesystem on the second partition.
 

Dawson

Explorer
Joined
Jun 17, 2023
Messages
80
Assumptions: All of your three drives are identical models. If we need to play with the partition sizes I'll need to do more.

In the examples below:

ada0 is "Drive A" that got the Proxmox wipe.
ada1 is "Drive B" that we thought failed, but it was just loose cabling and it's now recovered.
ada2 is "Drive C" that's the last disk standing.

You can see from ada1p2 that the last transaction group committed was 2025670 and ada2p2 is 2083363 - assuming you basically hovered around the 5-second default txg timeout period that's around 80 hours.

Confirm with serial numbers that the order of these drives hasn't changed.
So, step zero is underway - clone Drive A to Clone A.

Assuming order has been maintained ada0 should have label 11b39573-ad95-11ed-8d1c-7df9cea98351

So, here's what we're going to do.

Finish your clone of Drive A to Clone A. Pull the original Drive A, set it aside, and replace it with Clone A. Get it presented back to the system in the exact same way. Instructions are in the spoiler.

Good. Buckle up.

Again, confirm with serial numbers that the order of these drives hasn't changed. We don't want to target the wrong drives.

Check the partition table on Drive C with
gpart backup ada2
If it looks good, and has output like below
Code:
GPT 128
1   freebsd-swap      128  4194304
2    freebsd-zfs  4194432 12582744

If it looks similar to the above (but with a way bigger number at the end) then:

Clone the partition table from Drive C to Clone A with
gpart backup ada2 | gpart restore ada0

Check the partition table on Clone A with
gpart backup ada0
It should be identical (same model drives, same partition layout)

See if you get an output from zdb -l ada0p2 now. If you do, then this is a good thing - check the txg number near the top. Hopefully it's closer to ada2p2's 2083363 than the older ada1p2 number.

Rewrite the missing GPTID of 11b39573-ad95-11ed-8d1c-7df9cea98351 to Clone A with
gpart modify -i2 11b39573-ad95-11ed-8d1c-7df9cea98351 ada0

Reboot. Go back to the command line and check the results of zpool import which will hopefully give you the pool available for import:

Code:
root@freenas-lab[~]# zpool import
   pool: recoverme
     id: 9933807979428463458
  state: DEGRADED
status: One or more devices are missing from the system.
 action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
 config:


You'll probably need to do zpool import -F or -FX as well.

Tried the backup and restore command you gave me. This is the output.
ada1 is the good drive ada2 is the cloned drive

Code:
root@truenas[~]# gpart backup ada1 | gpart restore ada2
gpart: geom 'ada2': File exists
root@truenas[~]#


gpart backup ada1 looks like you said tho. That's good at least
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Tried the backup and restore command you gave me. This is the output.
ada1 is the good drive ada2 is the cloned drive

Code:
root@truenas[~]# gpart backup ada1 | gpart restore ada2
gpart: geom 'ada2': File exists
root@truenas[~]#


gpart backup ada1 looks like you said tho. That's good at least
Perhaps Proxmox dropped some manner of filesystem on there where it did the disk wipe.

What does gpart backup ada2 look like? I'd direct it to a file just to be safe with gpart backup ada2 > ada2.gpart and then you can save the resulting file. If ada2 doesn't look like the output from ada1 then I'd say we overwrite with gpart backup ada1 | gpart restore -F ada2 and then try the zdb -l /dev/ada2p2 line to see if it picks up a valid label.
 

Dawson

Explorer
Joined
Jun 17, 2023
Messages
80
Perhaps Proxmox dropped some manner of filesystem on there where it did the disk wipe.

What does gpart backup ada2 look like? I'd direct it to a file just to be safe with gpart backup ada2 > ada2.gpart and then you can save the resulting file. If ada2 doesn't look like the output from ada1 then I'd say we overwrite with gpart backup ada1 | gpart restore -F ada2 and then try the zdb -l /dev/ada2p2 line to see if it picks up a valid label.
That worked. A quick change in the way the drives are labeled as I had to move stuff to a different PC. ada1 = Good Drive ada0 = Clone

Code:
root@truenas[~]# gpart backup ada1 | gpart restore -F ada0
root@truenas[~]# gpart backup ada0
GPT 128
1   freebsd-swap        128    4194304
2    freebsd-zfs    4194432 7809842696
root@truenas[~]#

^ looks good

I continued to follow your steps in the instructions and here's our next roadblock:

Code:
root@truenas[~]# zdb -l /dev/ada0p2
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'Tank'
    state: 0
    txg: 2083363
    pool_guid: 2717787786726095806
    errata: 0
    hostid: 1361597103
    hostname: ''
    top_guid: 12486228298157547035
    guid: 2470301540868142256
    vdev_children: 2
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 12486228298157547035
        nparity: 1
        metaslab_array: 74
        metaslab_shift: 34
        ashift: 12
        asize: 11995904212992
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 2470301540868142256
            path: '/dev/gptid/11b39573-ad95-11ed-8d1c-7df9cea98351'
            DTL: 48182
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 3459701388371009720
            path: '/dev/gptid/11bac542-ad95-11ed-8d1c-7df9cea98351'
            not_present: 1
            DTL: 48181
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 16273966696595496550
            path: '/dev/gptid/11c0215d-ad95-11ed-8d1c-7df9cea98351'
            DTL: 48180
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3
root@truenas[~]# gpart modify -i2 11b39573-ad95-11ed-8d1c-7df9cea98351 ada0
gpart: Invalid number of arguments.
root@truenas[~]#


The
Code:
gpart modify -i2 11b39573-ad95-11ed-8d1c-7df9cea98351 ada0
command is incomplete.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
That worked. A quick change in the way the drives are labeled as I had to move stuff to a different PC. ada1 = Good Drive ada0 = Clone

Code:
root@truenas[~]# gpart backup ada1 | gpart restore -F ada0
root@truenas[~]# gpart backup ada0
GPT 128
1   freebsd-swap        128    4194304
2    freebsd-zfs    4194432 7809842696
root@truenas[~]#

^ looks good

I continued to follow your steps in the instructions and here's our next roadblock:

Code:
root@truenas[~]# zdb -l /dev/ada0p2
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'Tank'
    state: 0
    txg: 2083363
    pool_guid: 2717787786726095806
    errata: 0
    hostid: 1361597103
    hostname: ''
    top_guid: 12486228298157547035
    guid: 2470301540868142256
    vdev_children: 2
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 12486228298157547035
        nparity: 1
        metaslab_array: 74
        metaslab_shift: 34
        ashift: 12
        asize: 11995904212992
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 2470301540868142256
            path: '/dev/gptid/11b39573-ad95-11ed-8d1c-7df9cea98351'
            DTL: 48182
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 3459701388371009720
            path: '/dev/gptid/11bac542-ad95-11ed-8d1c-7df9cea98351'
            not_present: 1
            DTL: 48181
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 16273966696595496550
            path: '/dev/gptid/11c0215d-ad95-11ed-8d1c-7df9cea98351'
            DTL: 48180
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3
root@truenas[~]# gpart modify -i2 11b39573-ad95-11ed-8d1c-7df9cea98351 ada0
gpart: Invalid number of arguments.
root@truenas[~]#


The
Code:
gpart modify -i2 11b39573-ad95-11ed-8d1c-7df9cea98351 ada0
command is incomplete.

Whoops. Forgot to add the -l (lowercase L) switch in the code.

Code:
gpart modify -i2 -l 11b39573-ad95-11ed-8d1c-7df9cea98351 ada0


If it's successful it will just show ada0p2 modified

The good news if we look at your zdb output is this line:

Code:
    txg: 2083363


Which is an exact match to txg 2083363 from the "good" drive and the log.
 

Dawson

Explorer
Joined
Jun 17, 2023
Messages
80
All looks good, had to do zpool import -FX otherwise I'd get i/o errors. It's been importing for 20ish mins now. Does this process usually take some time? I can see it's using some CPU and definitely reading the disk a lot even now, so it's definitely not just doing nothing.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
All looks good, had to do zpool import -FX otherwise I'd get i/o errors. It's been importing for 20ish mins now. Does this process usually take some time? I can see it's using some CPU and definitely reading the disk a lot even now, so it's definitely not just doing nothing.

I imagine it will take some time to go through and determine what needs to be done here to get a valid pool state. Open a second SSH session and run gstat -p to see what the disks are doing as well as top or htop to see what the highest CPU process is.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
With it doing 100% reads from ada0+ada1 (ada2 is cache, ada3 is log?) I'd suggest that the best course of action is to let it continue. I can't really tell you how long it will take, but at 100MB/s it would take around 11-12 hours to walk a full 4TB disk. If the disks aren't full, then it could be less time.
 
Top