Remove a failed disk from a boot-pool mirror

Mark Stega

Dabbler
Joined
Dec 29, 2014
Messages
24
On SCALE 22.12.3.1 I have a mirrored boot pool with two SSDs; One of the two is now marked as 'Faulted' and shows a number of read and write errors. In the same boot pool status screen the triple dot option for the faulted drive only shows a 'replace' option. I can't do that as I am out of sata ports. If I remember correctly. CORE had the option to detach (or was it remove?) a drive. How do I do this with SCALE?

[EDIT] As soon as I posted this I had a reference to https://www.truenas.com/community/threads/how-to-replace-disk-in-mirrored-boot-pool.108123/

That solution doesn't work for me due to the lack of sata ports.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hi @Mark Stega

Seems as if there's a bug in SCALE that isn't allowing removal/detach of boot-pool devices.

Do you have a replacement drive you can use now, or are you just seeking to remove the one failed device so that it isn't showing as FAULTED?
 

Mark Stega

Dabbler
Joined
Dec 29, 2014
Messages
24
No replacement available right now so I'd just like to remove the drive. And even with a replacement (which should be here tomorrow) I can't do a 'replace' since I am out of sata ports.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hi Mark,

For now, we'll need to dig into the command line a little bit. This is easiest done over SSH, which you can set up by going to System Settings -> Services - click the pencil beside SSH and then set up to allow Log in as Admin with Password. Save the settings, return to the previous screen, and toggle the Running status to On. From there you can use any SSH compatible client (eg: PuTTY under Windows) to connect to TrueNAS as admin/yourpasswordhere

I've simulated a failure on my side here to make it easier to follow along. I'm assuming that there's no hot-swap ability on your system for safety's sake.

Start with sudo zpool status boot-pool

Code:
admin@scale01[~]$ sudo zpool status boot-pool
[sudo] password for admin:
  pool: boot-pool
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 6.87G in 00:02:24 with 0 errors on Tue Jul  4 15:29:54 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda3    ONLINE       0     0     0
            sde3    REMOVED      0     0     0

errors: No known data errors


In my example, the "healthy" drive is sda3 and the other sde3 is failed/removed. Look at the SMART output of your "healthy" drive to find the serial number (nb: "Serial" is case-sensitive in grep, try lowercase if you get no results)

sudo smartctl -a /dev/sda3 | grep Serial

Code:
admin@scale01[~]$ sudo smartctl -a /dev/sda3 | grep Serial
Serial number:        6000c295f0d47a00989c983cd95b1d25


Since your device is "still alive, but faulted" you can also look at the same value for your FAULTED device. Make a note of these numbers for later, when you're physically removing the failed device.

Now, let's logically remove the failed device from the boot pool with sudo zpool detach boot-pool FAULTED_DEVICE_ID

Code:
admin@scale01[~]$ sudo zpool detach boot-pool sde3
admin@scale01[~]$ sudo zpool status -v boot-pool
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: resilvered 6.87G in 00:02:24 with 0 errors on Tue Jul  4 15:29:54 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sda3      ONLINE       0     0     0

errors: No known data errors


From here, shut down TrueNAS from the UI, locate your boot devices, and remove the one with the serial number matching the FAULTED device.

Once that's done, you can install the new blank unit, boot up, and ATTACH from within the boot pool status page.
 

Mark Stega

Dabbler
Joined
Dec 29, 2014
Messages
24
@HoneyBadger

Thanks for the detailed instructions; I am way from my office for a couple of days and will follow these when I get back, it looks pretty straight-forward.
 

Mark Stega

Dabbler
Joined
Dec 29, 2014
Messages
24
@HoneyBadger

Bizarre -- I just started my NAS to remove the failed drive and the boot pool now shows a functional mirror with no errors. I'll keep a bookmark to your directions should I need them in the future.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
@HoneyBadger

Bizarre -- I just started my NAS to remove the failed drive and the boot pool now shows a functional mirror with no errors. I'll keep a bookmark to your directions should I need them in the future.
Hopefully if you do find a failed device, it's far enough in the future that I've submitted a PR to let you remove the device from the UI. :wink:

I'd suggest pulling the SMART data from your boot device(s) and ensure they aren't logging any unexpected faults - maybe check into the cabling and condition of the connectors on both ends, looking for any signs of fraying/bends in the cables, and corrosion/oxidization on the connectors. Those can show up as "non-media errors" in SMART data, or CKSUM (checksum) errors in a pool.
 

jr-m4

Cadet
Joined
Jan 11, 2024
Messages
1
Hopefully if you do find a failed device, it's far enough in the future that I've submitted a PR to let you remove the device from the UI. :wink:

I'd suggest pulling the SMART data from your boot device(s) and ensure they aren't logging any unexpected faults - maybe check into the cabling and condition of the connectors on both ends, looking for any signs of fraying/bends in the cables, and corrosion/oxidization on the connectors. Those can show up as "non-media errors" in SMART data, or CKSUM (checksum) errors in a pool.
Hi,

Sorry-NotSorry for necroposting this one and giving a heads up. However, it is still not possible to remove a disk from the boot-pool via the GUI, even in the Dragonfish-24.04-RC.1.

I did get it solved via your instructions through the CLI.
Thanks:smile:
 
Top