Help replacing failed drive, confusing status display

oldnewby · Dec 13, 2022

My FreeNAS 11.3U5 system with a 5 drive raidZ2 has changed to DEGRADED state, but no drives are marked failed. However, one drive did fail sometime in the last few months (yes, I should pay more attention to it). It failed by powering itself off and refusing to power on, so it is just not there at all. I think it was the hot spare (yes, I have since read the posts which suggest a hot spare is not a good idea), not a data drive, but I am not sure Here is the zpool status output:

Code:

zpool status -x
  pool: MainRaid6
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 256K in 0 days 10:48:53 with 0 errors on Sun Nov 27 10:48:56 2022
config:

        NAME                                              STATE     READ WRITE CKSUM
        MainRaid6                                         DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/613abf2f-ad3f-11e7-8e64-fcaa1427b384    ONLINE       0     0     0
            gptid/61e338a2-ad3f-11e7-8e64-fcaa1427b384    ONLINE       0     0     0
            gptid/62a5a502-ad3f-11e7-8e64-fcaa1427b384    ONLINE       0     0     0
            gptid/63652114-ad3f-11e7-8e64-fcaa1427b384    ONLINE       0     0     0
            spare-4                                       DEGRADED     0     0     0
              3550455103967566402                         UNAVAIL      0     0     0  was /dev/gptid/64276aca-ad3f-11e7-8e64-fcaa1427b384
              gptid/70ea7543-eece-11ea-9987-fcaa1427b384  ONLINE       0     0     0
            gptid/64e4d84c-ad3f-11e7-8e64-fcaa1427b384    ONLINE       0     0     0
        spares
          16295930666946285623                            INUSE     was /dev/gptid/70ea7543-eece-11ea-9987-fcaa1427b384

errors: No known data errors

and the GUI shows:

I have since unplugged the failed drive and put in a new one, which I erased. I suspect what I need to do now is simply do a replace, but the GUI display with ada0p2 listed twice is confusing me as is the "INUSE" output from zpool status. The suggested action from zpool status is impossible since the missing device is totally dead. Thank you very much for any assistance.

jgreco · Dec 13, 2022

Your array is degraded because a component is in a degraded state.

One drive in the RAIDZ2 became "unavailable", meaning ZFS could not see it.

oldnewby said:

TrueNAS saw that you had a spare drive available and initiated an automatic replacement operation. This operation looks like it completed successfully.

Code:

spare-4                                       DEGRADED     0     0     0
              3550455103967566402                         UNAVAIL      0     0     0  was /dev/gptid/64276aca-ad3f-11e7-8e64-fcaa1427b384
              gptid/70ea7543-eece-11ea-9987-fcaa1427b384  ONLINE       0     0     0

That drive came from your spares pool (which only had the one spare), so it is now tagged as INUSE even though it is still listed in the spares pool

Code:

        spares
          16295930666946285623                            INUSE     was /dev/gptid/70ea7543-eece-11ea-9987-fcaa1427b384

Because the replaced drive isn't automatically detached upon replacement (which could cause various problems if this was not successful), the system is waiting for you to do ... something ... to remove the unavailable drive and return that disk to a happy state. I don't remember the exact thing that you're supposed to do here. It's presumably in the manual, or maybe some kind forum member who's had a drive failure in recent memory can tell us. It's probably "remove" or something like that.

Once you do that, ZFS should return to its happy ONLINE state. Which you're already sorta in, except that ZFS is seeing that UNAVAIL drive and it Does Not Like That.

oldnewby said:
The suggested action from zpool status is impossible since the missing device is totally dead.

I agree this is a little confusing. What's happened here is that the designers of ZFS assumed that a sysadmin would be manually replacing a failed drive and would be aware of the steps involved if the drive was indeed failed. In such a case, you would ignore the conservative zpool status advice and proceed right to drive replacement (the need for which is unknowable to the system). Instead, TrueNAS noticed the failed drive and initiated a drive replacement on your behalf. ZFS still has no way of knowing the drive is truly toast, and still offers the conservative advice, but meanwhile TrueNAS is replacing the device since that option is available to it. Make sense?

oldnewby said:
I have since read the posts which suggest a hot spare is not a good idea

A hot spare is a great idea... in certain cases. For example, an 11-drive RAIDZ3 in a 12-bay chassis means that a single drive failure will always be acted on immediately. A bunch of RAIDZ2's with a single spare shared amongst them is also very efficient.

sretalla · Dec 13, 2022

jgreco said:
I don't remember the exact thing that you're supposed to do here. It's presumably in the manual, or maybe some kind forum member who's had a drive failure in recent memory can tell us. It's probably "remove" or something like that.

zpool detach poolname <drive to detach> (which would usually be the gptid of the spare if you want it to go back to being a spare)

You also could have just detached the faulted disk rather than replacing it to permanently co-opt the spare into the pool member role, then add another spare.

oldnewby · Dec 13, 2022

Thank you jgreco, your explanation makes sense to me and is very helpful and answered all my questions, except the one you said you didn't remember, of course.

Likewise, thank you sretalla. But you leave me with a further quandary. From what I read in the manual I think the <drive to detach> should be the gptid of the UNAVAIL failed drive which in this case is /dev/gptid/64276aca-ad3f-11e7-8e64-fcaa1427b384. Then I would want to remove the old spare and add my new drive as a new hot spare.

BTW, I was planning to upgrade to TrueNAS CORE 13.0 but it seemed to me I should get my array into a good state first. I also updated my signature because I realized I counted wrong and the array contains 6 disks, not 5.

oldnewby · Dec 13, 2022

So, I did what sretalla said, detach the faulted disk, i.e.

Code:

zpool detach MainRaid6 3550455103967566402

and now my pool is healthy. Thank you!!!

Important Announcement for the TrueNAS Community.

Help replacing failed drive, confusing status display

oldnewby

Cadet

jgreco

Resident Grinch

sretalla

Powered by Neutrality

oldnewby

Cadet

oldnewby

Cadet

Similar threads