Simultaneous UNAVAIL and FAULTED disks

Status
Not open for further replies.

IanWorthington

Contributor
Joined
Sep 13, 2013
Messages
144
Whilst I was trying to hot mount a disk[1] on my FreeNAS box today one drive in my raid-z3 went UNAVAIL and another FAULTED with 77 write errors.

I'm guessing I must have fat-fingered a cable (or two?). I'm currently running smartctl -t long on both drives, -t short having found no errors, but, assuming the disks are ok, I'm unsure how to proceed after that to:

1. reintroduce the UNAVAIL disk back into the array: Do I simply REPLACE the volume with itself and allow it to resilver? (https://bugs.freenas.org/issues/1952 seems to suggest there's a potential problem here?)

2. deal with the FAULTED drive. Do I have to get it into UNAVAIL state then REPLACE as (1).

Ian

[1] disk was NOT a zfs disk.
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
So your pool shouldn't have gone offline. If it did then stop and post back because RAIDZ3 should protect you from up to 3 disk failures.

Assuming they pass their tests then you should shutdown the system, plug the disks back in if they aren't already, boot the system up and do a scrub. This is why you don't replace disks online unless you have no choice. ;)

That bug ticket has nothing to do with your situation.
 

IanWorthington

Contributor
Joined
Sep 13, 2013
Messages
144
Hi CJ.

No, the pool's still working, just in degraded state:

Code:
ian@freenas:~ % sudo zpool status
  pool: VOLUME1
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 208K in 0h0m with 0 errors on Fri Aug  8 13:31:45 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        VOLUME1                                         DEGRADED     0     0     0
          raidz3-0                                      DEGRADED     0     0     0
            gptid/444302b9-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/44acaf47-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/458b61fe-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/45f04d30-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/46dd2963-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/47cdf0aa-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/48565317-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/48cd4928-da2a-11e3-90c3-002590878c66  FAULTED      0    77     0  too many errors
            gptid/4a58c8ac-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            8178567110075351471                         UNAVAIL      0     0     0  was /dev/gptid/4c508137-da2a-11e3-90c3-002590878c66
            gptid/4d50f8ba-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0

errors: No known data errors


I've rebooted but those volumes are still marked FAULTED and UNAVAIL. Do I need to do the ZPOOL CLEAR to get them back ONLINE before running the scrub?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
No, zpool clear will zero out the errors listed, but that's it.

So have the short and long tests completed?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Then I'd say you need to pull out the two disks, zero them on another computer, then put them back in FreeNAS and do disk replacements one at a time, waiting for each to resilver before doing the next.

Just like the thread I saw before this one, you shouldn't be doing stuff hot unless you've spent the money for enterprise-class everything. This kind of thing can *kill* a pool if enough disks are dropped.
 

IanWorthington

Contributor
Joined
Sep 13, 2013
Messages
144
Well, this is a bit odd... I powered down the box again to relocate the UNAVAIL disk back to its original physical SAS slot before attempting the above. After I started it up again everything suddenly seems to have fixed itself:

Code:
ian@freenas:~ % zpool status -x
  pool: VOLUME1
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 11.0M in 0h0m with 0 errors on Sun Aug 10 13:07:16 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        VOLUME1                                         ONLINE       0     0     0
          raidz3-0                                      ONLINE       0     0     0
            gptid/444302b9-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/44acaf47-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/458b61fe-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/45f04d30-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/46dd2963-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/47cdf0aa-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/48565317-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/48cd4928-da2a-11e3-90c3-002590878c66  ONLINE       0     0     2
            gptid/4a58c8ac-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0
            gptid/4c508137-da2a-11e3-90c3-002590878c66  ONLINE       0     0     1
            gptid/4d50f8ba-da2a-11e3-90c3-002590878c66  ONLINE       0     0     0

errors: No known data errors


I've started a scrub but that usually takes a full day or so to complete.

In the mean time I'm a bit confused:

1. Why should relocating the UNAVAIL device make FreeNas able to sort itself out? I thought it didn't care about the logical address?

2. It is likely that the 11.0M resilver referred to above is the lost writes to the UNAVAIL and/or FAULTED drives? Any link to how this works would be appreciated.

3. If the SCRUB detects any additional data errors it will RESILVER those so if my pool is flagged as healthy afterwards, then I'm ok, correct?
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
If the drives dropped off the system live, simply rebooting will usually bring them back into the pool. And since the system knew what 'time' (txg more specifically) the drive(s) dropped, once they 'come back', it simply has to resilver whats changed. Basically it 'catches up' the drive(s). That's how I understand it anyway. If the scrub does end up fixing things, you'll see it in zpool status.

I previously had a freenas box that ended up having flaky sata ports / controllers. In the middle of a backup I'd lose two drives. I'd get 'device disconnected kernel messages. I'd wait for the backup to finish, reboot the box, and it would see the two drives that dropped again, and 'catch them up'. And the amount it said was resilvered was appx the difference in data written to the pool between when the drives dropped, and when I rebooted. I'd run a scrub just to be sure, but it never found anything.

On one of my recent nas's I tried to switch to what I thought were better sff8087 breakout cables. I hooked them up, started the box, and it had an offline drive. I don't remember what else I was doing, but I ended up losing (device disconnected) more drives than I had redundancy. The pool simply whet to 'faulted' or something. I rebooted, it found all the drives again, resilvered a little bit, and a subsequent scrub didn't even show anything wrong. I have my doubts as to whether hardware raid would have handled that as well as ZFS did. I've been extremely impressed at how resilient it is. Needless to say I pulled those new cables out and when back to what works.

Yes, once the scrub finishes, I'd do a 'zpool clear' to zero out the checksum errors, and go from there.
 
Status
Not open for further replies.
Top