Faulted gptid can't be removed or replaced

turgidfoamymaggot · Jan 20, 2015

I'm very green with ZFS as this is my first experience with it, so please bear with me.

Some background. I've inherited a NAS which I'm already fairly positive has some massive hardware issues. It cannot be replaced yet. I did not build this NAS, but it appears to be a series of RAID-1's created using the LSI RAID card, which are then added to zpools.

This weekend, there was an unexpected failure where several drives dropped offline. Entire RAID-1's were lost. The LSI controller thinks that drives have failed, though I suspect either the backplane or RAID card. Regardless, I have one impacted pool:

Code:

[root@ redacted-hostname] ~# zpool status nfs-vol2
  pool: nfs-vol2
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 0 in 5h57m with 0 errors on Sun Dec 28 05:59:17 2014
config:

    NAME                                            STATE     READ WRITE CKSUM
    nfs-vol2                                        DEGRADED     0     0     0
     raidz1-0                                      DEGRADED     0     0     0
       gptid/c2369674-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
       gptid/c25f0288-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
       gptid/c2870e79-6ec3-11e4-a59f-0025901d2102  FAULTED      0    84     0  too many errors
       gptid/c2ae317d-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
     raidz1-1                                      DEGRADED     0     0     0
       gptid/c2d7ce90-6ec3-11e4-a59f-0025901d2102  FAULTED      0    92     0  too many errors
       gptid/c2ff73b2-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
       gptid/c3268d97-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
       gptid/c34a69ae-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0

errors: No known data errors
[root@ redacted-hostname] ~#

Neither of the faulted drives exist in /dev:

Code:

[root@ redacted-hostname] ~# ls -l /dev/gptid/c2870e79-6ec3-11e4-a59f-0025901d2102
ls: /dev/gptid/c2870e79-6ec3-11e4-a59f-0025901d2102: No such file or directory
[root@ redacted-hostname] ~# ls -l /dev/gptid/c2d7ce90-6ec3-11e4-a59f-0025901d2102
ls: /dev/gptid/c2d7ce90-6ec3-11e4-a59f-0025901d2102: No such file or directory
[root@ redacted-hostname] ~#

I tried to replace the faulted disks through the GUI, but I get this:

Code:

Error: Disk replacement failed: "cannot replace gptid/c2d7ce90-6ec3-11e4-a59f-0025901d2102 with gptid/e10c1ba4-a107-11e4-b02f-0025901d2102: no such device in pool, "

Note that while other disks are displayed as mfid[0-16], the two faulted disks only show the GPTID.

If I try to remove one of the faulted devices from a shell, I get this:

Code:

[root@ redacted-hostname] ~# zpool remove nfs-vol2 gptid/c2870e79-6ec3-11e4-a59f-0025901d2102
cannot remove gptid/c2870e79-6ec3-11e4-a59f-0025901d2102: no such device in pool

Likewise, gpart list does not find the device:

Code:

[root@ redacted-hostname] ~# gpart list | grep "c2870e79-6ec3-11e4-a59f-0025901d2102"
[root@ redacted-hostname] ~#

I'm sure this is something stupid simple to do, but I cannot figure out how to replace the device, since I can't remove the device because it doesn't exist.

I gather that RAIDZ-1 is not recommended, yet here we are. Blowing away the storage pool is not an option at this time.

cyberjock · Jan 20, 2015

If you are using mfid devices, then you are definitely using hardware RAID on ZFS (of course you said that up front too). I'm not surprised that you are having problems. Now you're going to have a tough time getting out of that rut (if you even can). At this point you're probably better off backing up your data, destroying the pool and getting proper hardware. You're lucky.. usually mfid users are putting in tickets with iXsystems where I have to tell them to kiss the data goodbye.

I've got no advice because this will quickly get really complicated and will go far beyond what I do in the forum. Sorry.

Edit: Yes, I realize you're probably about to tell me that it's not an option and you need to make do for now.. but this is a technical problem and just because your bosses aren't wanting to deal with it now doesn't mean they can ignore it. This nasty hairball was made by your previous admin, and your bosses are going to be stuck with dealing with it now.. like it or not. :( We don't have always have the choice on when we want to deal with something, and this is pretty much where things are going.

As my local Jimmy John's says on the wall "You should do the things you need to do when you need to do them so you can do the things you want to do when you want to do them." I'm guessing they wanted what they got, and now they need to do what they need to do because they already tried to do what they wanted. Time for them to man up, own the mistake, and take action to fix it.

I will say that you are far closer to a failed pool than you probably realize, and it's also possible you will be unable to resilver the pool anyway (or it will start and crash and then your pool will be gone). So backing up your data right now should be the absolute #1 priority over everything else. If your pool goes away you'll find the recovery costs are quite expensive. The last time someone tried it was $20k for 500GB of data.

Good luck.

mjws00 · Jan 20, 2015

Cyberjock put in a nice edit as I was typing. Couldn't agree more on the backups. But I wouldn't give up hope at this point.

I don't have much more hope than Cyberjock, but so far it looks like you got pretty lucky, and only one pair, acting as a drive, fell out of each vdev. Assuming you can get backups done done. How about posting actual hardware information, and FreeNAS version so we have the lay of the land.

So start with the basics.

Show us the output of 'camcontrol devlist', 'glabel status', 'dmesg', if you want to attach a debug file that might be useful.

What does the lsi controller bios show you? Are there missing drives, faulted mirrors, bottom line is you have to fix them or provide some other means for zfs to replace the faulted devices. We likely need to fix the storage subsystem before we can tackle the pool. If the RAID controller or backplane failed, it has to be replaced, since RAID is in the mix we can't just throw the pool in a different box.

Can't tell what you were doing with your replace command. Where did the random gptid device you were trying to use come from?

Even if we can't offline or replace the device via gptid. We can remove it via GUID. See https://bugs.pcbsd.org/issues/5035 for an example by William.

So hopefully with a little work you can get the pool stable. I'm hoping you get lucky on the resilver, it's not like you can leave it as is. Ultimately you're still screwed until you can backup, rebuild a proper pool, restore.

Good luck.

turgidfoamymaggot · Jan 20, 2015

cyberjock said:
If you are using mfid devices, then you are definitely using hardware RAID on ZFS (of course you said that up front too). I'm not surprised that you are having problems. Now you're going to have a tough time getting out of that rut (if you even can). At this point you're probably better off backing up your data, destroying the pool and getting proper hardware. You're lucky.. usually mfid users are putting in tickets with iXsystems where I have to tell them to kiss the data goodbye.

First, thanks for responding. I believe the primary issue, aside from bad initial setup, is failing hardware. I don't trust it at all. The hardware was obviously put together from parts laying around the office. The chassis cover doesn't fit without displacing the RAID card. I found dust bunnies jammed into SATA slots (wondered why a drive was flakey, found out why). I've come to terms with the fact that I can only do so much, and may lose the data set.

cyberjock said:
Good luck.

Thanks!

turgidfoamymaggot · Jan 20, 2015

mjws00 said:
Cyberjock put in a nice edit as I was typing. Couldn't agree more on the backups. But I wouldn't give up hope at this point.

I don't have much more hope than Cyberjock, but so far it looks like you got pretty lucky, and only one pair, acting as a drive, fell out of each vdev. Assuming you can get backups done done. How about posting actual hardware information, and FreeNAS version so we have the lay of the land.

So start with the basics.

Show us the output of 'camcontrol devlist', 'glabel status', 'dmesg', if you want to attach a debug file that might be useful.

Chassis: 4U Supermicro, 3.5" drive bays
Backplane: Unknown
Motherboard: X8DT3 series
RAID Card: LSI 9260-8i
FreeNAS Version: FreeNAS-9.2.1.5-RELEASE-x64 (80c1d35)
uname:

Code:

FreeBSD redacted-hostname 9.2-RELEASE-p4 FreeBSD 9.2-RELEASE-p4 #0 r262572+17a4d3d: Wed Apr 23 10:09:38 PDT 2014     root@build3.ixsystems.com:/tank/home/jkh/build/9.2.1/freenas/os-base/amd64/fusion/jkh/9.2.1/freenas/FreeBSD/src/sys/FREENAS.amd64  amd64

camcontrol devlist. The OS is running off SATA0 on a Samsung 840 Pro. The storage disks live in a 4U Supermicro chassis on a series of 2TB SATA disks in RAID-1s (and one RAID-0, no idea why) behind an LSI 9260-8i RAID controller:

Code:

[root@ redacted-hostname] ~# camcontrol devlist
<Samsung SSD 840 PRO Series DXM05B0Q>  at scbus0 target 0 lun 0 (ada0,pass0)

glabel status:

Code:

[root@ redacted-hostname] ~# glabel status
                                      Name  Status  Components
gptid/c2369674-6ec3-11e4-a59f-0025901d2102     N/A  mfid0p2
gptid/c25f0288-6ec3-11e4-a59f-0025901d2102     N/A  mfid1p2
gptid/c2ae317d-6ec3-11e4-a59f-0025901d2102     N/A  mfid3p2
gptid/c2ff73b2-6ec3-11e4-a59f-0025901d2102     N/A  mfid5p2
gptid/c3268d97-6ec3-11e4-a59f-0025901d2102     N/A  mfid6p2
gptid/c34a69ae-6ec3-11e4-a59f-0025901d2102     N/A  mfid7p2
gptid/3e5edd97-832e-11e4-b5d9-0025901d2102     N/A  mfid8p2
gptid/af461e06-80a9-11e4-a3a4-0025901d2102     N/A  mfid9p2
gptid/64f02a1f-80a9-11e4-a3a4-0025901d2102     N/A  mfid10p2
gptid/655a5594-80a9-11e4-a3a4-0025901d2102     N/A  mfid11p2
gptid/65c104b5-80a9-11e4-a3a4-0025901d2102     N/A  mfid12p2
gptid/afab3c56-80a9-11e4-a3a4-0025901d2102     N/A  mfid13p2
gptid/cca50f21-8163-11e4-8164-0025901d2102     N/A  mfid15p2
                             ufs/FreeNASs3     N/A  ada0s3
                             ufs/FreeNASs4     N/A  ada0s4
                            ufs/FreeNASs1a     N/A  ada0s1a
gptid/e305ba64-9df9-11e4-b02f-0025901d2102     N/A  mfid16p1
gptid/e3155ab5-9df9-11e4-b02f-0025901d2102     N/A  mfid16p2
gptid/e10527dd-a107-11e4-b02f-0025901d2102     N/A  mfid14p1
gptid/e10c1ba4-a107-11e4-b02f-0025901d2102     N/A  mfid14p2

turgidfoamymaggot · Jan 20, 2015

dmesg is well over the 3000 character limit. It's full of power state changes, and unexpected sense messages. If you'd like, I'll post it in parts.

Which switches do you want to see from freenas-debug?

mjws00 said:
What does the lsi controller bios show you? Are there missing drives, faulted mirrors, bottom line is you have to fix them or provide some other means for zfs to replace the faulted devices. We likely need to fix the storage subsystem before we can tackle the pool. If the RAID controller or backplane failed, it has to be replaced, since RAID is in the mix we can't just throw the pool in a different box.

Code:

[root@ redacted-hostname] ~# mfiutil show adapter
mfi0 Adapter:
    Product Name: LSI MegaRAID SAS 9260-8i
   Serial Number: <redacted>
        Firmware: 12.12.0-0036
     RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
  Battery Backup: present
           NVRAM: 32K
  Onboard Memory: 512M
  Minimum Stripe: 8k
  Maximum Stripe: 1M

There are definitely issues with the underlying storage. It's less bad than when I got to it this weekend, however the drive location feature stopped working. Given that the storage is already compromised, I don't feel comfortable pulling and replacing disks by guesswork.

Code:

[root@ redacted-hostname] ~# mfiutil show volumes
mfi0 Volumes:
  Id     Size    Level   Stripe  State   Cache   Name
mfid0 ( 1862G) RAID-1      64k OPTIMAL Enabled
mfid1 ( 1862G) RAID-1       1M OPTIMAL Enabled
     3 ( 1862G) RAID-1       1M OFFLINE Enabled
mfid3 ( 1862G) RAID-1       1M OPTIMAL Enabled
     5 ( 1862G) RAID-1       1M OFFLINE Enabled
mfid5 ( 1862G) RAID-1       1M OPTIMAL Enabled
mfid6 ( 1862G) RAID-1       1M OPTIMAL Enabled
mfid7 ( 1862G) RAID-0      64k OPTIMAL Enabled
mfid8 ( 1862G) RAID-1      64k OPTIMAL Enabled
mfid9 ( 1862G) RAID-1      64k OPTIMAL Enabled
mfid10 ( 1862G) RAID-1      64k OPTIMAL Enabled
mfid11 ( 1862G) RAID-1      64k OPTIMAL Enabled
mfid12 ( 1862G) RAID-1      64k OPTIMAL Enabled
mfid13 ( 1862G) RAID-1      64k OPTIMAL Enabled
mfid14 ( 1862G) RAID-1      64k OPTIMAL Enabled
mfid15 ( 1862G) RAID-1      64k OPTIMAL Enabled
mfid16 ( 1862G) RAID-1      64k OPTIMAL Enabled
[root@ redacted-hostname] ~#
[root@ redacted-hostname] ~#
[root@ redacted-hostname] ~# mfiutil show drives
mfi0 Physical Drives:
6 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S0
7 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S1
8 (   0.0) FAILED    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S2
9 (   0.0) HOT SPARE <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S3
10 (   0.0) FAILED    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S6
12 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S1
13 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S2
14 (   0.0) HOT SPARE <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S3
15 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S4
16 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S9
18 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S4
19 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S5
20 (   0.0) FAILED    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S7
21 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S5
22 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S8
23 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S10
24 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S10
25 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S8
26 (   0.0) HOT SPARE <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S9
27 (   0.0) HOT SPARE <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E1:S11
28 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S11
29 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S14
30 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S15
31 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S16
32 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S17
33 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S21
34 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S23
35 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S22
36 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S0
37 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S7
38 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S6
39 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S13
40 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S20
41 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S19
42 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S18
43 (   0.0) ONLINE    <Hitachi HDS72202 A3MA serial=redacted-serial-number> SATA E2:S12

mjws00 said:
Can't tell what you were doing with your replace command. Where did the random gptid device you were trying to use come from?

It's visible in the web interface, where one would normally replace failed disks, and here:

Code:

[root@ redacted-hostname] ~# zpool status nfs-vol2
  pool: nfs-vol2
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0 in 5h57m with 0 errors on Sun Dec 28 05:59:17 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        nfs-vol2                                        DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            gptid/c2369674-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
            gptid/c25f0288-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
            gptid/c2870e79-6ec3-11e4-a59f-0025901d2102  FAULTED      0    84     0  too many errors
            gptid/c2ae317d-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
          raidz1-1                                      DEGRADED     0     0     0
            gptid/c2d7ce90-6ec3-11e4-a59f-0025901d2102  FAULTED      0    92     0  too many errors
            gptid/c2ff73b2-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
            gptid/c3268d97-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0
            gptid/c34a69ae-6ec3-11e4-a59f-0025901d2102  ONLINE       0     0     0

errors: No known data errors

mjws00 said:
Even if we can't offline or replace the device via gptid. We can remove it via GUID. See https://bugs.pcbsd.org/issues/5035 for an example by William.

So hopefully with a little work you can get the pool stable. I'm hoping you get lucky on the resilver, it's not like you can leave it as is. Ultimately you're still screwed until you can backup, rebuild a proper pool, restore.

Good luck.

Thanks for taking the time to respond, and especially for the pcbsd link. I'll try to apply this tomorrow when I'm back in the office, and in a position to hit the data center if needed.

Edit: I have two unused RAID-1 disk sets which are available to replace the failed sets. Forgot to mention that.

cyberjock · Jan 20, 2015

I'm gonna say this again, then I'm out of this conversation.. if you try to resilver there's a very high likelyhood that you will have some kind of corruption on one of the other remaining disks in the vdevs, and there's a high chance it will crash ZFS. If this happens there's also a high chance that on reboot the pool will be labeled as corrupted and be unmountable. I'm not referring to the standard "RAIDZ1 is dead" argument either. This likelihood is exasperated by the fact that you are doing hardware RAID. It caches writes and that can damage ZFS in unimaginable ways that will become evident as soon as you start touching the ZFS metadata.

So proceed at your own risk.. but for your own sake and that of your job, MAKE A BACKUP FIRST. I've watched people with hardware RAIDs go this route, and if you get to the resilvering you are NOT out of the woods.. the resilver is often the part that blows up in your face and since you can't query all the necessary info from the RAID controller because we don't include the necessary diagnostic tools you'll just have to face the fact that ZFS calls the pool corrupt and its gone.

Several exact scenarios like this one come to mind, but I'm not gonna go trying to look for them at 1AM. Sorry. :/

mjws00 · Jan 21, 2015

It always makes me sad when mr. jock backs away. I keep looking for your magic, man. :)

Couldn't agree more on the backup. Bottom line is you currently have no redundancy in either vdev, and a single device error from here is game over. After you've saved everything possible, personally, I'd still go for the fix even if it's unsuccessful. You'll learn something. But even if it works there isn't much upside... you are stuck with a poorly configured pool that is unreliable. It may buy you a little time but the pain will still come. Probably much faster to reconfigure, now that your hand has been forced.

Seems like we are talking about half the system? This is ~38 drives and we are looking at 16 (with the capacity of 6). 2 6TB drives would back this pool up.

Couple other things I'd be interested in 'mfiutil show config', full output of zpool status (are there other pools?) nfs-vol1?

Sorry man, it isn't pretty. I'm not quick to back down from a fight and happy to help. But we can't control a failed resilver and there is a pretty high probability that the damage is already irrecoverable. So plan on that while you can access the data. I'd already be selling complete failure and the necessary disaster recovery scenario.

Tough gig to inherit. Excellent learning opportunity.

turgidfoamymaggot · Jan 21, 2015

mjws00 said:
It always makes me sad when mr. jock backs away. I keep looking for your magic, man. :)

Couldn't agree more on the backup. Bottom line is you currently have no redundancy in either vdev, and a single device error from here is game over. After you've saved everything possible, personally, I'd still go for the fix even if it's unsuccessful. You'll learn something. But even if it works there isn't much upside... you are stuck with a poorly configured pool that is unreliable. It may buy you a little time but the pain will still come. Probably much faster to reconfigure, now that your hand has been forced.

Seems like we are talking about half the system? This is ~38 drives and we are looking at 16 (with the capacity of 6). 2 6TB drives would back this pool up.

Couple other things I'd be interested in 'mfiutil show config', full output of zpool status (are there other pools?) nfs-vol1?

Sorry man, it isn't pretty. I'm not quick to back down from a fight and happy to help. But we can't control a failed resilver and there is a pretty high probability that the damage is already irrecoverable. So plan on that while you can access the data. I'd already be selling complete failure and the necessary disaster recovery scenario.

Tough gig to inherit. Excellent learning opportunity.

We've got another NetApp shelf coming online later, just not yet. Management is aware and has accepted the state of the existing FreeNAS box. I understand the importance of backing up before trying to resilver, but that is unfortunately not an option. Choices are allow the pool to limp along until we get more storage, or try to repair it.

Code:

mfi0 Configuration: 17 arrays, 17 volumes, 4 spares
    array 0 of 2 drives:
        drive 12 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 13 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 1 of 2 drives:
        drive  6 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive  7 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 2 of 2 drives:
        drive  8 (   0.0) FAILED <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive MISSING 
    array 3 of 2 drives:
        drive 36 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 19 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 4 of 2 drives:
        drive 10 (   0.0) FAILED <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 20 (   0.0) FAILED <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 5 of 2 drives:
        drive 25 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 18 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 6 of 2 drives:
        drive 24 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 34 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 7 of 1 drives:
        drive 43 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 8 of 2 drives:
        drive 38 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 37 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 9 of 2 drives:
        drive 15 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 21 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 10 of 2 drives:
        drive 22 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 16 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 11 of 2 drives:
        drive 23 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 28 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 12 of 2 drives:
        drive 39 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 29 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 13 of 2 drives:
        drive 30 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 31 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 14 of 2 drives:
        drive 32 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 42 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 15 of 2 drives:
        drive 33 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 35 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    array 16 of 2 drives:
        drive 40 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
        drive 41 (   0.0) ONLINE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    volume mfid0 (1862G) RAID-1 64k OPTIMAL spans:
        array 0
    volume mfid1 (1862G) RAID-1 1M OPTIMAL spans:
        array 1
    volume 3 (1862G) RAID-1 1M OFFLINE spans:
        array 2
    volume mfid3 (1862G) RAID-1 1M OPTIMAL spans:
        array 3
    volume 5 (1862G) RAID-1 1M OFFLINE spans:
        array 4
    volume mfid5 (1862G) RAID-1 1M OPTIMAL spans:
        array 5
    volume mfid6 (1862G) RAID-1 1M OPTIMAL spans:
        array 6
    volume mfid7 (1862G) RAID-0 64k OPTIMAL spans:
        array 7
    volume mfid8 (1862G) RAID-1 64k OPTIMAL spans:
        array 8
    volume mfid9 (1862G) RAID-1 64k OPTIMAL spans:
        array 9
    volume mfid10 (1862G) RAID-1 64k OPTIMAL spans:
        array 10
    volume mfid11 (1862G) RAID-1 64k OPTIMAL spans:
        array 11
    volume mfid12 (1862G) RAID-1 64k OPTIMAL spans:
        array 12
    volume mfid13 (1862G) RAID-1 64k OPTIMAL spans:
        array 13
    volume mfid14 (1862G) RAID-1 64k OPTIMAL spans:
        array 14
    volume mfid15 (1862G) RAID-1 64k OPTIMAL spans:
        array 15
    volume mfid16 (1862G) RAID-1 64k OPTIMAL spans:
        array 16
    global spare  9 (   0.0) HOT SPARE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    global spare 14 (   0.0) HOT SPARE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    global spare 26 (   0.0) HOT SPARE <Hitachi HDS72202 A3MA redacted-serial-number> SATA
    global spare 27 (   0.0) HOT SPARE <Hitachi HDS72202 A3MA redacted-serial-number> SATA

Frankly, we're at the point where if it's that much of a risk to resilver, we'll wait for our additional storage to come online, kill nfs-vol2, and create a backup from scratch on NetApp.

mjws00 · Jan 21, 2015

There's no real way to evaluate the risk, imho. It's a different deal if you can move the data. Or even if you have the luxury of messing with the storage subsystem, and knowing first hand how things respond. Assuming your new gear will come in a timely fashion, and management has already acknowledged the nfs-vol2 could fail at any time... better off to leave it then risk killing it early. At a client's there is no way I'd touch it without a solid recovery plan, and complete anticipation of failure. I'd literally grab a few drives, and move the data, or drop in a temp box to host nfs-vol2.

I think you actually hit that bug I linked for you with 'c' named gptid's. So we can definitely make it to a resilvering. It just isn't worth the risk if the replacement plan is already in motion.

I'd sure be interested to see if the recovery succeeds if it were mine and the new gear and migration was already done. But that is more curiosity than necessity.

Good luck, let us know how it shakes out.

turgidfoamymaggot · Jan 21, 2015

mjws00 said:
There's no real way to evaluate the risk, imho. It's a different deal if you can move the data. Or even if you have the luxury of messing with the storage subsystem, and knowing first hand how things respond. Assuming your new gear will come in a timely fashion, and management has already acknowledged the nfs-vol2 could fail at any time... better off to leave it then risk killing it early. At a client's there is no way I'd touch it without a solid recovery plan, and complete anticipation of failure. I'd literally grab a few drives, and move the data, or drop in a temp box to host nfs-vol2.

I think you actually hit that bug I linked for you with 'c' named gptid's. So we can definitely make it to a resilvering. It just isn't worth the risk if the replacement plan is already in motion.

I'd sure be interested to see if the recovery succeeds if it were mine and the new gear and migration was already done. But that is more curiosity than necessity.

Good luck, let us know how it shakes out.

Once the alternate storage is online, and the data is moved, I'll see how recoverable nfs-vol2 is. I'll try to remember to update this thread with details. Thanks again.

jkh · Jan 21, 2015

You could also buy a JBOD full of drives and create a new pool attached to the existing box, then migrate from evil-pool-A to happy-pool-B. This would an order of magnitude cheaper than waiting for an entire NetApp system, which you're also going to have to migrate to over some sort of network connection which is also going to be a constraint.

Important Announcement for the TrueNAS Community.

Faulted gptid can't be removed or replaced

turgidfoamymaggot

Cadet

cyberjock

Inactive Account

mjws00

Guru

turgidfoamymaggot

Cadet

turgidfoamymaggot

Cadet

turgidfoamymaggot

Cadet

cyberjock

Inactive Account

mjws00

Guru

turgidfoamymaggot

Cadet

mjws00

Guru

turgidfoamymaggot

Cadet

jkh

Guest

Similar threads