Pool won't import, disks are now missing

Status
Not open for further replies.

nwest1

Cadet
Joined
Dec 16, 2013
Messages
9
Hello-
I'm having some issues with the storage on my Freenas node:
- Pool won't mount
- Looks like this is due to disks unable to be seen by gpart
- GPT partition corruption messages in dmesg on the two disks that are 'missing'

System details:
- I'm running the OS off of a thumb drive
- amd64, 6gb ram

Last few changes I've made:
- Backed up 8.3 settings
- Image upgrade from 8.3 to 9.1
- Imported old settings
- created plugin jail
- didn't like it, reimaged, reimported old settings
- zfs wouldn't import

I've read quite a few of the potential fixes but the few commands I have tried (gpart recover in particular) error because it's unable to find the missing disks geometry.

Here are the commands I've gathered help you diagnose:

Code:
# uname -a
FreeBSD nas.home.local 9.1-STABLE FreeBSD 9.1-STABLE #0 r+16f6355: Tue Aug 27 00:38:40 PDT 2013    root@build.ixsystems.com:/tank/home/jkh/src/freenas/os-base/amd64/tank/home/jkh/src/freenas/FreeBSD/src/sys/FREENAS.amd64  amd64


Code:
# zpool import
   pool: tank
     id: 1000707064672
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
devices and try again.
   see: http://illumos.org/msg/ZFS-8000-3C
 config:
 
tank                                            UNAVAIL  insufficient replicas
  raidz1-0                                      UNAVAIL  insufficient replicas
    gptid/3d72a75e-34cb-11e1-8b03-001d7da27a90  ONLINE
    gptid/3e125b84-34cb-11e1-8b03-001d7da27a90  ONLINE
    12189763471895637697                        UNAVAIL  cannot open
    5492254585232235844                         UNAVAIL  cannot open


Code:
# camcontrol devlist
<WDC WD20EARX-42R6B0 02.00A02>     at scbus1 target 0 lun 0 (ada0,pass0)
<WDC WD20EARX-42R6B0 02.00A02>     at scbus2 target 0 lun 0 (ada1,pass1)
<WDC WD20EARX-42R6B0 02.00A02>     at scbus3 target 0 lun 0 (ada2,pass2)
<WDC WD20EARX-42R6B0 02.00A02>     at scbus4 target 0 lun 0 (ada3,pass3)
<  PMAP>


Code:
# gpart show
=>        34  3907026988  ada0  GPT  (1.8T)
          34          94        - free -  (47k)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  3902832590     2  freebsd-zfs  (1.8T)
 
=>        34  3907029101  ada1  GPT  (1.8T)
          34          94        - free -  (47k)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  3902834703     2  freebsd-zfs  (1.8T)
 
=>      63  15133185  da0  MBR  (7.2G)
        63   1930257    1  freebsd  [active]  (942M)
   1930320        63       - free -  (31k)
   1930383   1930257    2  freebsd  (942M)
   3860640      3024    3  freebsd  (1.5M)
   3863664     41328    4  freebsd  (20M)
   3904992  11228256       - free -  (5.4G)
 
=>      0  1930257  da0s1  BSD  (942M)
        0       16         - free -  (8.0k)
       16  1930241      1  !0  (942M)


Code:
# glabel status
                                      Name  Status  Components
gptid/3d72a75e-34cb-11e1-8b03-001d7da27a90     N/A  ada0p2
gptid/3e125b84-34cb-11e1-8b03-001d7da27a90     N/A  ada1p2
                             ufs/FreeNASs3     N/A  da0s3
                             ufs/FreeNASs4     N/A  da0s4
                            ufs/FreeNASs1a     N/A  da0s1a



Code:
# dmesg | grep ada2
ada2 at ata3 bus 0 scbus3 target 0 lun 0
ada2: <WDC WD20EARX-42R6B0 02.00A02> ATA-8 SATA 1.x device
ada2: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada2: quirks=0x1<4K>
ada2: Previously was known as ad6
GEOM: ada2: the secondary GPT table is corrupt or invalid.
GEOM: ada2: using the primary only -- recovery suggested.
GEOM_MULTIPATH: ada2 added to disk2
GEOM_MULTIPATH: ada2 is now active path in disk2


Code:
# dmesg | grep ada3
ada3 at ata4 bus 0 scbus4 target 0 lun 0
ada3: <WDC WD20EARX-42R6B0 02.00A02> ATA-8 SATA 1.x device
ada3: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada3: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada3: quirks=0x1<4K>
ada3: Previously was known as ad8
GEOM: ada3: the secondary GPT table is corrupt or invalid.
GEOM: ada3: using the primary only -- recovery suggested.
GEOM_MULTIPATH: ada3 added to disk1
GEOM_MULTIPATH: ada3 is now active path in disk1


Full dmesg log: http://pastebin.com/cDjbLQcq

Thanks for getting this far.

Any help would be greatly appreciated.

Regards,
Nathan
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
So you need to do a gpart recovery on disks ada2 and ada3. This is particularly disheartening because that error means data on the end of the disk was rewritten..which also means you may have corruption(or a permanently unmountable pool) since you have a single disk's redundancy but have 2 "bad disks" you might never see your data again.

Read this thread on how to do the gpart recovery on your disks. I gave a good step by step guide for doing this. ;)

I will warn you that since the primary is valid you are only restoring the backup partition table. Fixing this shouldn't make your pool mountable. But since you have nothing to lose right now you might as well do it.

Before I do the gpart recovery though I'd make sure you are on FreeNAS 9.1.1 64 bit.

Can you post your actual hardware specs.. amd64 is vague and you didn't mention your motherboard or RAM
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
If all the disks are identical, then it may make sense to just clone the gpart config from one of the good disks to the corrupted ones. Basically something like: "gpart backup ada0 | gpart restore -F ada2 ada3".
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yeah.. not a good idea as your gptids will go bye-bye and you'll REALLY be in trouble.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Why? ZFS doesn't care about GPTIDs and if you then Auto Import the pool the GUI will be updated. I just tried it in a VM and it worked (I created a 4 drive RAIDZ, then cloned the partition tables and the pool imported fine with the new GPTIDs).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
ZFS does care about the gptids.. it does gptid sums to help it determine what disks are part of what vdev. let me find the command for it...
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Waiting... ;)
(Please also notice that the OP already tried gpart recover and it did not work/failed.)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Doh.. you're right. I'm confusing guid with gptid. Should drink some coffee before I post more. guids are summed and must match. ;)

I'd still use the disk's actual gptids. All you need is a disk with a different firmware that has some extra sectors and you just shot yourself in the foot. We've had people long ago that lost data from duplicating tables from one disk to another.

I'm just saying my way is the "safer" way to do it. And since I'm sure he's about to say "I have no backups" he should probably stick with the safer way.

I did see the gpart recover comment. Unfortunately info on using gpart recover is hard to come-by. That's why I started that thread when I had problems. :)
 

nwest1

Cadet
Joined
Dec 16, 2013
Messages
9
Whoa Guys-
Thanks for jumping on this one. I really appreciate it.

A partial answer to the hardware (when I get home I can update with exact specs):
6GB Ram (no ECC, but I guess I'll be changing that!)

I did try and follow your post earlier. "gpart recover ada2" and "gpart recover ada3". I can get exact output of the commands when I get home, but from memory it was one of two: "unknown geometry" or "'ada2' invalid argument"

A few questions:
- "gpart show" doesn't return ada2 or ada3. But those show in dmesg as corrupted. What is gpart looking at that is different?
- Is the same reason that I'm getting errors on the "gpart recover adaX" the same reason "gpart show" doesn't detect those disks?
- What is the disconnect between the "camcontrol devlist" and the gpart show command?
- Could you explain this comment? What tells you only the backup partition table is invalid? Why can't I just override the backup partition table? Why wouldn't fixing this make it mountable?
"I will warn you that since the primary is valid you are only restoring the backup partition table. Fixing this shouldn't make your pool mountable. But since you have nothing to lose right now you might as well do it."

I do have backups of the important things, but would prefer to recover it without a nuke and pave.

Would the consensus be:
- get exact details on hardware
- try the recover again
- check back here
- last resort: restore using gpart from one of the good disks


Thanks again-
Nathan
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
A few questions:
- "gpart show" doesn't return ada2 or ada3. But those show in dmesg as corrupted. What is gpart looking at that is different?
- Is the same reason that I'm getting errors on the "gpart recover adaX" the same reason "gpart show" doesn't detect those disks?
- What is the disconnect between the "camcontrol devlist" and the gpart show command?
gpart show only lists a drive if it can find a partition table on that drive.
camcontrol devlist lists all devices known to the FreeBSD CAM (Common Access Method) subsystem (which handles disk drives).
So, in your case the OS knows about the drives (camcontrol), but gpart is not able to find any partitions.
Looking at your dmesg output I'm a bit puzzled by the multipath lines. Could you please provide output of gmultipath list?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I did try and follow your post earlier. "gpart recover ada2" and "gpart recover ada3". I can get exact output of the commands when I get home, but from memory it was one of two: "unknown geometry" or "'ada2' invalid argument".

Yeah, if you could post the exact commands and their output so we can verify your commands are correct that might be helpful. Just read that thread I linked above and follow that.

A few questions:
- "gpart show" doesn't return ada2 or ada3. But those show in dmesg as corrupted. What is gpart looking at that is different?

Not sure. Not going to worry about that at the moment. Keywords: at the moment
- Is the same reason that I'm getting errors on the "gpart recover adaX" the same reason "gpart show" doesn't detect those disks?
Not sure. :)
- What is the disconnect between the "camcontrol devlist" and the gpart show command?
camcontrol devlist lists hard drives the system has available. gpart show lists disks that are available AND have a valid partition structure. The fact that it doesn't list 2 of your disks may be an indicator that you won't be able to recover from the "backup copy of the gpt". We shall see depending on your commands and what they return
- Could you explain this comment? What tells you only the backup partition table is invalid? Why can't I just override the backup partition table? Why wouldn't fixing this make it mountable?
GPT tables have 2 copies for extra protection. One at the beginning of the disk and one at the end. If either one is damaged the OS should use the good one and give you the warning you are getting via dmesg. So the OS should be properly using the disks and the warning is to tell you that one copy is bad. Your gpart recover should fix that problem. But this won't do anything to make it mountable since your OS should be using the good copy anyway.

Now, if your disk has no good gpt table(which is questionable as to if it has one or not.. little sketchy on the specifics since gpart list doesn't list all 4 drives) you may have to try Dusan's recommendation... copy your gpt from another disk. But we should do that last as its somewhat undoable and there's no reason we can't try my way and then his way if mine doesn't work. I'm all about being conservative before more liberal with other people's data.

The bad thing is that the partition table isn't normally written to. So whatever trashed one copy of your gpt likely damaged some of the data on the pool. Since you are a RAIDZ1 you may have some corrupt files. How many and what files will not be known until we get your pool to mount.


I do have backups of the important things, but would prefer to recover it without a nuke and pave.

Would the consensus be:
- get exact details on hardware
- try the recover again
- check back here
- last resort: restore using gpart from one of the good disks

Yes, that's kind of the plan. I don't really need your exact hardware to try to gpart recover.

But a RAM test might be a good idea at some point (www.memtest.org). If you do end up in a position where you could potentially mount the pool I'd do a RAM test before doing the actual mounting the pool as a precaution. If at some point you will be going to bed you can easily make the CD/USB and boot it up and let it run. The test will take some hours to complete so letting it run overnight is the easiest way to let it run. Any errors means we stop until those are resolved(and we probably kiss your pool goodbye).

If you'd like some one-on-one with this I can probably accommodate that. You'll need to install Teamviewer so you can share your screen(it bypasses firewalls so don't worry about that) and I'll need a phone number if in the USA or a skype contact. You can PM me if you want to do it this way.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Please try the gmultipath list first. MULTIPATH should not be active unless you are using fancy hardware. From the dmesg output you can see that it is grabbing ada2 & ada3 for some reason, possibly preventing gpart from seeing the drives.
Also, does the FreeNAS GUI show anything in Storage → Volumes → View Multipaths?
 

nwest1

Cadet
Joined
Dec 16, 2013
Messages
9
First things first - thanks for the followup.

System Details:
Code:
hw.machine: amd64
hw.model: Intel(R) Pentium(R) Dual  CPU  E2180  @ 2.00GHz
hw.ncpu: 2
hw.machine_arch: amd64
 
6 gb ram
 
CPU: Intel(R) Pentium(R) Dual  CPU  E2180  @ 2.00GHz (2000.04-MHz K8-class CPU)
 
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs


Need anything else?

Recovery output:
Code:
# gpart show ada0
=>        34  3907026988  ada0  GPT  (1.8T)
          34          94        - free -  (47k)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  3902832590    2  freebsd-zfs  (1.8T)
 
# gpart show ada1
=>        34  3907029101  ada1  GPT  (1.8T)
          34          94        - free -  (47k)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  3902834703    2  freebsd-zfs  (1.8T)
 
# gpart show ada2
gpart: No such geom: ada2.
# gpart recover ada2
gpart: arg0 'ada2': Invalid argument


The path in the GUI: "Storage -> Volumes -> View Multipaths" doesn't exist.

As far as multipaths go, I'm not aware that I configured any.

Code:
# gmultipath list
Geom name: disk1
Type: AUTOMATIC
Mode: Active/Passive
UUID: f7350759-6605-11e3-a3c3-001d7da27a90
State: DEGRADED
Providers:
1. Name: multipath/disk1
  Mediasize: 2000398933504 (1.8T)
  Sectorsize: 512
  Stripesize: 4096
  Stripeoffset: 0
  Mode: r0w0e0
  State: DEGRADED
Consumers:
1. Name: ada3
  Mediasize: 2000398934016 (1.8T)
  Sectorsize: 512
  Stripesize: 4096
  Stripeoffset: 0
  Mode: r1w1e1
  State: ACTIVE
 
Geom name: disk2
Type: AUTOMATIC
Mode: Active/Passive
UUID: f4144963-66c6-11e3-ac3b-001d7da27a90
State: DEGRADED
Providers:
1. Name: multipath/disk2
  Mediasize: 2000398933504 (1.8T)
  Sectorsize: 512
  Stripesize: 4096
  Stripeoffset: 0
  Mode: r0w0e0
  State: DEGRADED
Consumers:
1. Name: ada2
  Mediasize: 2000398934016 (1.8T)
  Sectorsize: 512
  Stripesize: 4096
  Stripeoffset: 0
  Mode: r1w1e1
  State: ACTIVE


Hope this narrows down the issue. I'll start on that memtest and start it now.

Thanks again!
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
So far I have absolutely no idea how gmultipart got activated/created in your setup. However, I managed to reproduce your current state in a VM and then fix it.

I hope this helps:
Run these commands:
[PANEL]gmultipath destroy disk1
gmultipath destroy disk2
gpart recover ada2
gpart recover ada3[/PANEL]And then Auto Import the pool in the GUI.

My test in a VM:

[PRE][root@freenas] ~# zpool import
pool: tank
id: 6029064710914875581
state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
see: http://illumos.org/msg/ZFS-8000-3C
config:

tank UNAVAIL insufficient replicas
raidz1-0 UNAVAIL insufficient replicas
gptid/1ea42d83-677a-11e3-bb11-0800278c9e58 ONLINE
gptid/1eceda3a-677a-11e3-bb11-0800278c9e58 ONLINE
7589345734897075041 UNAVAIL cannot open
15983987099038280236 UNAVAIL cannot open
[root@freenas] ~# gmultipath status
Name Status Components
multipath/disk2 DEGRADED ada4 (ACTIVE)
multipath/disk1 DEGRADED ada3 (ACTIVE)
[root@freenas] ~# gmultipath destroy disk1
[root@freenas] ~# gmultipath destroy disk2
[root@freenas] ~# gpart recover ada3
ada3 recovered
[root@freenas] ~# gpart recover ada4
ada4 recovered
[root@freenas] ~# zpool import
pool: tank
id: 6029064710914875581
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

tank ONLINE
raidz1-0 ONLINE
gptid/1ea42d83-677a-11e3-bb11-0800278c9e58 ONLINE
gptid/1eceda3a-677a-11e3-bb11-0800278c9e58 ONLINE
gptid/1ef919a1-677a-11e3-bb11-0800278c9e58 ONLINE
gptid/1f234a26-677a-11e3-bb11-0800278c9e58 ONLINE
[/PRE]
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Wow. There is some dedication there Dusan! Good work!
 

nwest1

Cadet
Joined
Dec 16, 2013
Messages
9
So far I have absolutely no idea how gmultipart got activated/created in your setup. However, I managed to reproduce your current state in a VM and then fix it.


I'm amazed. Also, confused as to how this multipath stuff happened.

When I get home tonight, I'll walk through these steps and report back.

If I may ask a few other questions about this:
- Did you reproduce it by creating volumes normally (web gui?), and then adding multipath on those two drives?
- Does the multipath config sit on top of the partition table? Or how do these things relate?

Thanks for working through this. I really appreciate it.
Nathan
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
- Did you reproduce it by creating volumes normally (web gui?), and then adding multipath on those two drives?
Yes, I created the pool in the GUI, exported it and then stomped over it by activating multipath.
- Does the multipath config sit on top of the partition table? Or how do these things relate?
This answer will be a bit longer :) :
It depends on how you configure it. gpart as well as gmultipath are GEOM classes (http://www.freebsd.org/doc/handbook/geom.html, http://www.freebsd.org/cgi/man.cgi?query=geom&sektion=4). The GEOM classes are used to transform disk devices. Basically a geom consumes one or more "devices"and transforms them into some other (one or more) "devices". For example, you can use gpart to "consume" a physical disk and "provide" several partitions, then use geli (consumes one device and provides one device) to encrypt one of the partitions and then use gmirror (consumes several devices to provide one device) to create a mirror out of two such encrypted devices. So, you normally "chain" various geli classes. Normally you would first use gmultipath to construct a multipath and then gpart to partition it (it does not make much sense the other way around). What, it seems, somehow happened in your case is that the geoms were not chained but both "consumed" the same device. gpart stores the second copy of the partition table at the end of the drive. gmultipath uses the last sector to store its metadata and so it corrupted the second gpart table. Also, an active multipath protects its components and prevents you from accessing them directly.
 

nwest1

Cadet
Joined
Dec 16, 2013
Messages
9
So far so good.

All the multipath destroy and gpart recover commands worked great.

Ran the auto-import, but said the volume already exists.

Still had a yellow alert on the GUI, checked in more detail and it's resilvering one of the recovered disks.

Good thing it was just one of them.

I'll give a final (hopefully) update later with the results :)

Thanks for your time & detailed explanations.
 

nwest1

Cadet
Joined
Dec 16, 2013
Messages
9
Resilvering completed, however there are definitely errors on the two recovered volumes.

Running a scrub on the filesystem now. At this rate, I hope it's done in a few days :p

I have a feeling I should be looking into RMAing these two and getting some fresh ones.

Code:
# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Dec 19 17:59:40 2013
        52.3M scanned out of 7.10T at 2.28K/s, (scan is slow, no estimated time)
        0 repaired, 0.00% done
config:
 
NAME                                            STATE     READ WRITE CKSUM
tank                                            ONLINE       0     0     5
  raidz1-0                                      ONLINE       0     0    14
    gptid/3d72a75e-34cb-11e1-8b03-001d7da27a90  ONLINE       0     0     0
    gptid/3e125b84-34cb-11e1-8b03-001d7da27a90  ONLINE       0     0     0
    gptid/3eb2b461-34cb-11e1-8b03-001d7da27a90  ONLINE       5     0     0
    gptid/3f58ad31-34cb-11e1-8b03-001d7da27a90  ONLINE       0     0     7
 
errors: Permanent errors have been detected in the following files:
 
        <0x9c>:<0x1d22>
        <0x9c>:<0x1d24>
        <0x9c>:<0x1d29>
        <0x9c>:<0x1dc9>
        <0x9c>:<0x1df6>
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Still had a yellow alert on the GUI, checked in more detail and it's resilvering one of the recovered disks.
The damage probably wasn't limited only the the partition tables :(.
Running a scrub on the filesystem now. At this rate, I hope it's done in a few days :p

I have a feeling I should be looking into RMAing these two and getting some fresh ones.
At that speed it would take about 106 years to finish the scrub. Is anything else accessing the pool? Any CAM messages in /var/log/messages? Did you check the SMART status of the drives (smartcl -x <device>)?
 
Status
Not open for further replies.
Top