Didn't read manual before trying to replaced faulted drive, now stuck

Chuck Remes

Contributor
Joined
Jul 12, 2016
Messages
173
I should have known better. Don't beat me up, I've already kicked myself.

Details:
TrueNAS 12.0-U6
Supermicro X10SDV-8C-TLN4
Fractal Design R5 (white!)
128GB RAM
Seasonic 650-X
LSI 9211-8i
32GB SATA-DOM (boot)
7x 8TB Seagate Iron Wolf as RAIDZ2
1x 120GB SSD for VMs and docker

Yesterday one of my Seagate drives reported read errors in the RAIDZ2. I went into the GUI and tried to OFFLINE. The command returned no errors but a REFRESH didn't update the status. Instead of stopping here to read the manual, I continued on blissfully ignorant.

I rebooted. Disk came up as OFFLINE. I chose to REPLACE but it said it couldn't do it (I forget the message). I rebooted. Went into the GUI and tried REPLACE. It said the disk already had a partition table on it and couldn't be used. Again, instead of reading the manual I did something dumber... I googled "how to remove disk partitions freebsd" and ran
Code:
dd if=/dev/zero of=/dev/da2 bs=512 count=1
to blow away the partition table on the faulted disk. Then I rebooted.

Here's the current pool status:
1639582512597.png


Now when I try to replace through the GUI, it shows no available device in the drop down.

When I run
Code:
grep "ATA ST" /var/run/dmesg.boot
, I get this back:
Code:
da2: <ATA ST10000VN0004-1Z SC60> Fixed Direct Access SPC-4 SCSI device
da3: <ATA ST10000VN0004-1Z SC60> Fixed Direct Access SPC-4 SCSI device
da4: <ATA ST10000VN0004-1Z SC60> Fixed Direct Access SPC-4 SCSI device
da0: <ATA ST10000VN0004-1Z SC60> Fixed Direct Access SPC-4 SCSI device
da1: <ATA ST10000VN0004-1Z SC60> Fixed Direct Access SPC-4 SCSI device
da5: <ATA ST10000VN0004-1Z SC60> Fixed Direct Access SPC-4 SCSI device


I had 7 disks but now 6 are showing up. And, of course, the devices have renumbered themselves so there is a /dev/da2 again but it's a different hardware device.

I have NOW looked at the manual but I have been unable to find assistance for this problem. I do not have a replacement drive yet (ordered). I would like to replace with the old faulted drive since it only had a few READ errors. What are the proper steps here to accomplish this? GUI or CLI...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
What does "camcontrol devlist" tell you is out there for disks?

It is not unheard-of for a disk to develop some bad blocks as part of the process of dying, then to experience a catastrophic failure if power cycled. That's less likely if you merely rebooted without a power-off.

If the system isn't seeing the disk, your options are really to check all the cabling, maybe pull the "failed" drive to see if it's seen in another PC, etc., and make a call as to whether or not the disk hard-failed.
 

Chuck Remes

Contributor
Joined
Jul 12, 2016
Messages
173
Code:
% sudo camcontrol devlist
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 7 lun 0 (pass0,da0)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 8 lun 0 (pass1,da1)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 10 lun 0 (pass2,da2)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 11 lun 0 (pass3,da3)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 12 lun 0 (pass4,da4)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 13 lun 0 (pass5,da5)
<SuperMicro SSD SOB20R>            at scbus1 target 0 lun 0 (pass6,ada0)
<Samsung SSD 850 EVO 120GB EMT01B6Q>  at scbus4 target 0 lun 0 (pass7,ada1)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus7 target 0 lun 0 (pass8,ses0)

Doesn't see it there either. I didn't power cycle the unit. I will do that and pull the failed drive from the enclosure. Years back I'm pretty sure I put labels on all of these things so I could uniquely identify them in the chassis... here's hoping "past me" was smart enough to do that.

EDIT:
After a reboot, it now sees the 7th SATA disk:
Code:
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 7 lun 0 (pass0,da0)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 8 lun 0 (pass1,da1)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 9 lun 0 (pass2,da2)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 10 lun 0 (pass3,da3)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 11 lun 0 (pass4,da4)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 12 lun 0 (pass5,da5)
<ATA ST10000VN0004-1Z SC60>        at scbus0 target 13 lun 0 (pass6,da6)
<SuperMicro SSD SOB20R>            at scbus1 target 0 lun 0 (pass7,ada0)
<Samsung SSD 850 EVO 120GB EMT01B6Q>  at scbus4 target 0 lun 0 (pass8,ada1)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus7 target 0 lun 0 (pass9,ses0)

BTW, past me did NOT put external labels on the drive, so matching them will take a little work. Luckily, there ARE serial number stickers on the side so with a little `diskinfo -v` action I can match them all up pretty quickly.
 
Last edited:

Chuck Remes

Contributor
Joined
Jul 12, 2016
Messages
173
Went into the GUI and did a REPLACE now that the disk is recognized by the system again. Got an error.
1639591008663.png

`ls dev` prior to this showed a "da2" listed. Now it's not there anymore. Presumably the `gpart` command destroyed that file as part of its work.

What's the next thing to try?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Mmmm. "gpart destroy -F /dev/da2" failed?

That's really a non-fail-y command in the overall scheme of things. I'm not sure what the exact set of steps in FreeNAS is, but in my own tools, this is really the last major step in clearing a GEOM disk partition setup.

Are you sure that da2 hasn't disappeared again? That would be my suspicion. It started working on the disk, the disk gave up the ghost, and then the NAS is like "huh wha?" ...
 

Chuck Remes

Contributor
Joined
Jul 12, 2016
Messages
173
Thanks, @jgreco , I think you are right that the drive itself is trashed. I removed it and put it into an external chassis to mess with it and it grinds and goes offline. Luckily, these drives are still under warranty through March 2022 (by the skin of my teeth). I've already processed the RMA and it's going out today.

Thanks all for your help. I may revive this thread or start a new one when my replacement drive arrives and I need a hand replacing it.
 
Top