Disk offlining in Scale 22.02.4 does not "offline" it

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
Dear team and community,
the dreaded moment in every data hoarder life has finally arrived for me on the TrueNAS 22.02.4. In my HPE Microserver G10+ one of the drives kept failing read phase of SMART tests (failure at 90% remaining, 8 unreadable sectors), as a result TrueNAS marked the device as FAULTED and the pool as DEGRADED:
pool_status_webUI.png

Same is seen in CLI:
pool_status_tnCLI.png

and directly in zpool query:
pool_status_CLI.png

However, since I already have a replacement drive, upon following the Docs Hub instruction for drive replacement (https://www.truenas.com/docs/scale/scaletutorials/storage/disks/replacingdisks/) I went through with the usual/expected route of vertical elipsis menu of the drive -> offline:
offline.png

So far so good, but this does not seem to do anything - a pop-up pane with a revolving wheel of in-progress/working-on "Please wait..." shows for a moment (~2secs), then disappears without any message in web UI and without the drive actually being set to OFFLINE status. The DocsHub entry speaks of running a scrub in case offlining fails with an error (I haven't yet seen one), so I did, as shown in the webUI pool pane, without any change to the effect of the action/command.
What's more, upon OS installation a SWAP partition has been created by TrueNAS automatically on this drive (actually across all the drives in the storage pool used for system_datatset, I think this might have something to do with the "system" option during installation where one could pick either boot-pool or one of the storage pools) - 2GB in size:
lsblk.png

fdisk.png

gdisk.png

swap.png

I don't think this could be impacting the disk offlining in moria pool directly, since it's sda[1] partition, when sda[0] is used for the SWAP as set by TrueNAS, but it might be problematic nonetheless during the overall offlining part. Or at least it seems so, cause when the offline command is issued the logs show that MDADM responsible for SWAP created by TrueNAS upon installation kind of blocks/reverts the action by reinstating the SWAP MD devices:
md_swap.png

Please bear in mind that upon installation SWAP was supposed to kind of be placed on the boot-pool flash device, TrueNAS even created 16GB partition there, so I'm not sure why these on the storage drives SWAP devices are used... Again, I can only venture guessing it's related to that system_dataset config. Still, they're there, I'm not SWAPping in or out for most of the time, so most of the time no harm done. Until now, it seems...
Regardless, the DocsHub approach fails on me, as I cannot offline the FAULTED drive using the UI right now.
So how do I proceed? How do I replace the "faulted" disk (both in pool and SWAP) with a new one? I mean the whole sda (3RG7AWNA):
cli_disks.png

with a new one. Am I to power down the server, swap-out the old "faulted" drive, swap-in a new one and TrueNAS will be able to pickup new drive to automatically apply its partitioning scheme along with fixing memberships and resilvering the pool automatically? My guess is not, so how to proceed?
Please bear in mind this is a 4-slot device and all 4 are currently occupied by the members of the Z1 storage pool (and the aforementioned SWAP) so I cannot add my new extra disk inside and do replacements through the UI (though even if I had, I'm still expected to offline the faulty drive first, which currently seems to not work).
As such, how do I replace the drive?
Is there an UI (webUI or trueNAS CLI) option I'm missing that needs to be set/reset? Or maybe some shell commands I need to run to get it working?
I really need your help with this.
 

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
What I'm thinking of now is:
1) use sgdisk to backup partition table to a backup file somewhere on the pool,
2) power off the server,
3) replace physically old drive with a new one,
4) power on the server,
5) use sgdisk to import partition table from backup saved on the pool,
6) reboot the server again hoping that it'll allow the partitions be properly readded into the SWAP and storage pool,
7) somehow initiate resilver (though I don't know how, as I cannot find an option in the web UI for it).
I'd really appreciate some guidance.
 

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
OK, topic closed.
It seems that contrary to the DocksHub instruction for disk replacement, all you need for in-place replacement of disks is power off the device (as in server), replace old faulted device with a new one, start the system again, go to Storage > Pool > Status and choose Replace on the old device. This way, both the partitions, one for SWAP and the other for actual data storage/pool membership, get their scheme recreated and reattached in MDADM for SWAP and the pool resilvering actually completes successfully without any extra intervention.
So yeah, for in-place disk replacement all you need is actually power cycle with device replacement and a quick and easy "Replace" in the UI/CLI.
Guess the whole device replacement documentation on the site would be more adequate for more disks deployments with more professional options in place.
Please bear in mind that after the replacement and resilvering is completed, if you connect the device through a USB enclosure it'd still have both the MDADM SWAP array flags/superblocks and storage pool ZFS superblocks/flags, so you might want to clear those/wipe any data traces before following with other processes... So for example one could turn to any Wipe disk option, but either they need to first scan the whole disk in a fetching data stage before proceeding with actual wipes or the feature kind of fails/hangs on fetching data stage regardless... anyhow both shred and badblocks are present on the system so for cleaning the drive we have more unix-admin-preferred options anyway ;)
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
The thing that's apparently missing, both from the docs and from your understanding, is that FAULTED implies OFFLINE--not all OFFLINE disks are FAULTED, but all FAULTED disks are OFFLINE. If you have a FAULTED disk in your pool, you're free to do exactly as you did: physically replace it (powering off the server to do so if necessary), and then use the Replace feature in the GUI.
 

mkarwin

Dabbler
Joined
Jun 10, 2021
Messages
40
The thing that's apparently missing, both from the docs and from your understanding, is that FAULTED implies OFFLINE--not all OFFLINE disks are FAULTED, but all FAULTED disks are OFFLINE. If you have a FAULTED disk in your pool, you're free to do exactly as you did: physically replace it (powering off the server to do so if necessary), and then use the Replace feature in the GUI.
Well, if you check the DocsHub page (https://www.truenas.com/docs/scale/scaletutorials/storage/disks/replacingdisks/) it specifically mentions as follows:
Click the icon for the degraded pool to display the Pool Actions dropdown list and then click Status to display the Pool Status screen to locate the failed disk.
then the steps to fix the situation are as follows:
To replace a failed disk:
  1. Offline the disk.
  2. Pull the disk from your system and replace with a disk of at lease the same or greater capacity as the failed disk.
  3. Online the new disk.
where in the first step actions we see:
  1. Click more_vert next to the failed disk to display the Disk Actions dropdown menu of options.
  2. Click Offline. A confirmation dialog displays. Click Confirm and then Offline. The system begins the process to take the disk offline. When complete, the list of disks displays the status of the failed disk as Offline.
with example picture as follows:
DiskOfflineSCALE.png

where as you can clearly see, the drive previosuly shown as problematic (though in the start situation look the disk actually was defined as degraded not faulted, but was referenced throughout the text as "failed") throughout the instruction is clearly required to be offlined first and then such new status is reported in the attached example status pics and instruction text itself.
Heck, the instruction itself states as follows for the next step:
You can physically remove the disk from the system when the disk status is Offline.
so basically the instruction states that we begin with a failed (shown degraded in example but failure and fault is so close in meaning it might have been stated faulted as well) devices, we need to offline it first and only once it is offlined can you physically remove the disk from the system.
Throughout the instruction there is no mention/hint that TrueNAS UI/tools have such implication anywhere in the logic or inferenced by default, and what's more - everything presented in the DocsHub page related to drive replacement on disk failure clearly points to no such underlying logic taking place. So not only is it missing in the docs, it's kind of clearly stated to not be there - after all if it were the instructions should have stated such and not stipulate that you need to offline failed disks before actual removal and replacement. Perhaps this is due to the instruction following steps to alleviate the "degraded" disk status or similar failure, and not general failed or "faulted" ones, and should be stated as such in the begining. But alas, it is not defined as such. One might have thought that this could be due to maybe some copy&paste from the Core tutorials, but there the pictures show "removed" disk status... so I guess no, no copy&paste docu issue here ;)
 
Top