Remove Drive from Mirror - I've Made Some Mistakes

ASCII

Cadet
Joined
Jan 10, 2016
Messages
4
Long story short, I have a partially corrupt pool that was cause by some gross negligence on my part due to lack of alerting and a busy few months. Once i identified the failure i picked up some new (larger) drives and planned on upgrading the failing mirror and then simply deleting the corrupted files once that was completed. This is where i ran into some weird issues.

First, output of zpool status:

Code:
root@freenas:~ # zpool status
  pool: StoragePool001
state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 6.07T in 0 days 13:24:10 with 4594493 errors on Wed Jan 15 23:44:26 2020
config:

        NAME                                              STATE     READ WRITE CKSUM
        StoragePool001                                    DEGRADED 8.76M     0     0
          mirror-0                                        ONLINE       0     0     0
            gptid/ba8e4f65-f100-11e8-830c-000c29f0189f    ONLINE       0     0     0
            gptid/bb325aad-f100-11e8-830c-000c29f0189f    ONLINE       0     0     0
          mirror-1                                        ONLINE       0     0     0
            gptid/bc00738c-f100-11e8-830c-000c29f0189f    ONLINE       0     0     0
            gptid/bcafc4af-f100-11e8-830c-000c29f0189f    ONLINE       0     0     0
          mirror-2                                        DEGRADED 17.5M     0     0
            replacing-0                                   UNAVAIL      0     0     0
              12595850469584586315                        UNAVAIL      0     0     0  was /dev/gptid/bd8e1960-f100-11e8-830c-000c29f0189f
              gptid/4876745b-3736-11ea-8eba-000c29f0189f  ONLINE       0     0     0
              gptid/e86f027a-37b2-11ea-8eba-000c29f0189f  ONLINE       0     0     0
            gptid/be5fc6de-f100-11e8-830c-000c29f0189f    ONLINE       0     0 17.5M

errors: 4594488 data errors, use '-v' for a list

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:56 with 0 errors on Mon Jan 13 03:45:57 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

root@freenas:~ # glabel status
                                      Name  Status  Components
                           iso9660/FreeNAS     N/A  cd0
gptid/2b93f4e2-f0f9-11e8-aa6c-000c29f0189f     N/A  da0p1
gptid/bc00738c-f100-11e8-830c-000c29f0189f     N/A  da1p2
gptid/bcafc4af-f100-11e8-830c-000c29f0189f     N/A  da2p2
gptid/ba8e4f65-f100-11e8-830c-000c29f0189f     N/A  da3p2
gptid/be5fc6de-f100-11e8-830c-000c29f0189f     N/A  da4p2
gptid/bb325aad-f100-11e8-830c-000c29f0189f     N/A  da5p2
gptid/4876745b-3736-11ea-8eba-000c29f0189f     N/A  da6p2
gptid/e86f027a-37b2-11ea-8eba-000c29f0189f     N/A  da7p2
gptid/bbf7966a-f100-11e8-830c-000c29f0189f     N/A  da1p1
gptid/485c4b69-3736-11ea-8eba-000c29f0189f     N/A  da6p1
gptid/e85b1afd-37b2-11ea-8eba-000c29f0189f     N/A  da7p1



Mirror-2 is in a sorry state and contained the failing hardware. I replaced the 4TB 12595850469584586315 with a 10TB gptid/4876745b-3736-11ea-8eba-000c29f0189f which cause a resilver and I figured I could simply offline the drive and then replace its partner gptid/be5fc6de-f100-11e8-830c-000c29f0189f. However, when trying to offline 12595850469584586315 i receive the following error:

Code:
Error: Traceback (most recent call last):

  File "/usr/local/lib/python3.6/site-packages/tastypie/resources.py", line 219, in wrapper
    response = callback(request, *args, **kwargs)

  File "./freenasUI/api/resources.py", line 877, in offline_disk
    notifier().zfs_offline_disk(obj, deserialized.get('label'))

  File "./freenasUI/middleware/notifier.py", line 1056, in zfs_offline_disk
    raise MiddlewareError('Disk offline failed: "%s"' % error)

freenasUI.middleware.exceptions.MiddlewareError: [MiddlewareError: Disk offline failed: "cannot offline /dev/gptid/bd8e1960-f100-11e8-830c-000c29f0189f: no valid replicas, "]


Now where I really made a stupid mistake was to try to add the second new 10TB drive to replace the original failing drive (don't ask, it was a stupid thought) instead of replacing the gptid/be5fc6de-f100-11e8-830c-000c29f0189f drive that's still in the mirror.

So the current crux of my problem is that when i try to remove any one of the two new drives I added i get the same "no valid replicas" error. How do i go about removing gptid/4876745b-3736-11ea-8eba-000c29f0189f or gptid/e86f027a-37b2-11ea-8eba-000c29f0189f and replace gptid/be5fc6de-f100-11e8-830c-000c29f0189f with one of those two drives?

And lastly, will that let me bring the pool back into a "stable" setting so i can simply clear the corrupted files and move on? No really worried about losing the data as much as I don't want to rebuild the pool and integrations i've built around it.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
I don't think you can offline a device that doesn't exist. unfortunately, I dont have my VM host online so I cant try to replicate what you did. pretty sure you need to wait for your replacement(s) to finish, though, before trying anything else. you might have to descend to the commandline (you should be able to detach a resilvering device at the command line but I wouldnt recomend mucking about with it till at least one of those drives gets resilved and is fully online)
 

ASCII

Cadet
Joined
Jan 10, 2016
Messages
4
The issue here is that the drives that are under the "replacing" section have been like this for a number of days at this point. Resilvering completed after I initiated the drive replacement and a scrub was completed after that. My understanding is that I should be able to simply offline the failed/removed drive once another drive has been selected for replacement and the resilvering is completed.

Any thoughts on diagnostic commands I can run to see why this replacement process is "stuck"?
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
I would touch nothing until you have a backup, and then once you have a backup...i would nuke the whole thing and just remake the pool correctly.
once you have a backup, and want to try and fix the existing pool, it should be possible to just detach drives at the commandline until you get it where you want it
you want to detach, not offline; these are not the same thing.
 
Last edited:

ASCII

Cadet
Joined
Jan 10, 2016
Messages
4
Not worried about the data to be honest as I can stand to lose most of it without issue and the important stuff is already backed up.

I'm trying to learn a little bit from my (dumb) mistake and try to fix this, even if this means I have corrupted data. Any troubleshooting recommendations are appreciated.
 

ASCII

Cadet
Joined
Jan 10, 2016
Messages
4
Output from issuing detach commands for the drives in question:

Code:
root@freenas:~ # zpool detach StoragePool001 gptid/e86f027a-37b2-11ea-8eba-000c29f0189f
cannot detach gptid/e86f027a-37b2-11ea-8eba-000c29f0189f: no valid replicas

root@freenas:~ # zpool detach StoragePool001 gptid/4876745b-3736-11ea-8eba-000c29f0189f
cannot detach gptid/4876745b-3736-11ea-8eba-000c29f0189f: no valid replicas

root@freenas:~ # zpool detach StoragePool001 12595850469584586315
cannot detach 12595850469584586315: no valid replicas
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
hmm. I would guess you have no copies of some of the corrupted files, and thus no valid replicas. things like this are one of the reasons I was suggesting scrapping the pool and starting anew.
you might be able to delete the files with data errors and try again?
alternatively, you might be able to "-f" it to have it detach anyway, though I am not sure exactly how that will behave.
 
Top