Failed drive within Zpool

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
Wow -- something cool happened to me today. Failed drive within my Raid Z2 Zpool -- and the hot spare seems to have been activated:
Code:
freenas% zpool status
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:02:54 with 0 errors on Tue Feb 25 03:47:54 2020
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        ada0p2  ONLINE       0     0     0
        ada1p2  ONLINE       0     0     0
errors: No known data errors

  pool: tank
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: resilvered 981G in 0 days 05:43:10 with 0 errors on Mon Mar  2 10:12:25 2020
config:

    NAME                                              STATE     READ WRITE CKSUM
    tank                                              DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        spare-0                                       DEGRADED     0     0     0
          gptid/2e48e04a-d2f0-11e6-8e60-0cc47a84a594  FAULTED      6     5     0  too many errors
          gptid/a203d5ad-49e3-11ea-9739-0cc47a84a594  ONLINE       0     0     0
        gptid/2eff3431-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        gptid/2fad6079-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        gptid/305f9785-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        gptid/310fd248-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        gptid/31c62952-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        gptid/32845d1c-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
        gptid/3338ea10-d2f0-11e6-8e60-0cc47a84a594    ONLINE       0     0     0
    cache
      gptid/3385f967-d2f0-11e6-8e60-0cc47a84a594      ONLINE       0     0     0
    spares
      3314618157351433518                             INUSE     was /dev/gptid/a203d5ad-49e3-11ea-9739-0cc47a84a594

errors: No known data errors


Hmm - so that's now two WD Red Drives down in the last month. WTF??!!

Anyway, the main zpool physically is 8 5Tb drives. How do I physically identify the drive that is faulted so I can replace it?



Screen Shot 2020-03-02 at 11.24.25 PM.png


So it's looking like the da0p2 disk which would correspond to:
Code:
#geom disk list
...
...
Geom name: da0
Providers:
1. Name: da0
   Mediasize: 6001175126016 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e5
   descr: ATA WDC WD60EFRX-68L
   lunid: 50014ee2b894b316
   ident: WD-WX11D86KC68Y
   rotationrate: 5700
   fwsectors: 63
   fwheads: 255
...
...


So I'm guessing find whatever drive has serial number WD-WX11D86KC68Y?
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
So I'm guessing find whatever drive has serial number WD-WX11D86KC68Y?
Yes, that's the right approach.

Hmm - so that's now two WD Red Drives down in the last month.
You might want to consider the power supply or cabling, since drives can be faulted if they become unavailable to the OS even for a short time.
 

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
@sretalla
I just replaced all the cabling like last week. I didn't replace power supply (yet!!).

So just to save time with the replacement process. Can I just promote my spare to the main pool and then add my new drive as the hot spare? I'm not sure how to tell freenas to accomplish this? It seems my spare is still listed as spare0.

Another possibility would be to replace the bad drive with the new drive, let things resilver and then put the spare drive back into "hot spare" designation. Again not sure the specifics of how to this either. Do I need to reformat the spare drive in this case?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Can I just promote my spare to the main pool and then add my new drive as the hot spare? I'm not sure how to tell freenas to accomplish this? It seems my spare is still listed as spare0.
Sounds like you're asking about example 4-10 in this article: https://docs.oracle.com/cd/E19253-01/819-5461/gcvcw/index.html

You can use the GUI to wipe the new drive before adding it as a spare... I'm not sure if that's actually needed though.
 

JaimieV

Guru
Joined
Oct 12, 2012
Messages
742
You don't need to wipe the Spare manually, FreeNAS does the necessary on adding a drive as Spare. And after use it returns it to Spare status once you go through the Replace operation with a newly inserted drive. That's how I've been handling ejected drives and it works fine.

That said, over in https://www.ixsystems.com/community/threads/what-are-spare-drives-in-pools-useful-for.82778/ you can see that I'm somewhat bemused by the actions of Spares in FreeNAS, there are probably a few nuggets of info that will interest you there.

Perhaps you may be having the same possibly-spurious disk ejection as I do - I built up a stack of ejected disks, some truly dead/badblocked but many of them appear to be perfectly fine. Thread over at https://www.ixsystems.com/community...-or-sas-disk-shelf-failing.82205/#post-569531

(And many thanks to @sretalla for being my only correspondant on all these issues! Hopefully we'll get to the bottom of things eventually)
 

KevDog

Patron
Joined
Nov 26, 2016
Messages
462
Ok I wanted to give a quick followup for someone who runs into this type of problem. It's not as scary as it seems. I ran a degraded RAIDZ2 configuration for about a week with a hot spare that was activated automatically by FreeNAS which then kicked in a resilver after a swap. Pretty cool to actually see things work like they are supposed to.

As far as replacing the drive, you'll need to visit the Storage->Pools-Pools Status Page that can be found by looking at the picture below
Screen Shot 2020-03-20 at 11.41.33 PM.png

Once switch to the Status Page: You'll see the status of your drives:


Screen Shot 2020-03-02 at 11.24.25 PM.png


In my pool above you can see the da0p2 is the drive that is faulted. Unfortunately for replacing the drive -- this designation will need to be cross referenced with the drives Serial Number. This can best be done with the "geom disk list command" and then examining the output for the drive designation.

A snippet of the geom disk list command is shown below:
Code:
#geom disk list
...
...
Geom name: da0
Providers:
1. Name: da0
   Mediasize: 6001175126016 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e5
   descr: ATA WDC WD60EFRX-68L
   lunid: 50014ee2b894b316
   ident: WD-WX11D86KC68Y
   rotationrate: 5700
   fwsectors: 63
   fwheads: 255


The ident filed lists the serial number.

First take the disk Offline from the GUI:
Screen Shot 2020-03-20 at 11.50.48 PM.png


Once the disk is offline, power down freenas, pull the power cord and go hunting for the drive with the correct Serial Number. I luckily had put stickers on the sides of my drives where I had printed out the serial number so actually the identification was fairly easy. I recommend this step to save time in case of a drive failure.

Once replaced, reboot back into freenas and revisit the menu above. The new drive should be identified --- put the drive online and then select the Replace option. The identifier da0 was displayed in the GUI and this was the drive that I was attempted to replace. I hit return and then I didn't see any indication in the GUI what actually was happening. (possibly this could be an area of improvement).

Anyway from the command line, I typed zpool status and this gave an update of what was happening. The resilver process had been initiated the drive isn't going to be listed as available until the process is completed:
Screen Shot 2020-03-19 at 7.16.07 PM.png


The picture above shows progress the resilver -- 7.57% completed with a long way to go.

Once completed the status appeared as :
Screen Shot 2020-03-20 at 11.59.42 PM.png


Interestingly the "hot spare drive" was automatically removed out of the pool and placed back in the spare status. I would have been OK with the new drive being designated as the "new hot spare" keeping the old spare as part of the pool, however that's not what automatically transpired.

Anyway. I hope this helps someone. The replacement process was pretty quick and it's not all that scary either.
 

Attachments

  • Screen Shot 2020-03-02 at 11.24.25 PM.png
    Screen Shot 2020-03-02 at 11.24.25 PM.png
    563 KB · Views: 200
Top