A little confused regarding spares...

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Once you have detached the spare (do that first) then you can use the command,
detach SM2_stable_pool gptid/155dee1b-4568-11e7-be80-002590c44bba
to remove the failed disk.
Again, you can relate the da# to the gptid with the glabel status command and be sure to know what the da# is first because the sesutil command uses the da# to illuminate the bay that the drive is in. You don't want to remove the spare, just the failed drive.
 

Stephen Hayne

Dabbler
Joined
May 27, 2016
Messages
14
Hi Stephen,

Unfortunately, the output of your zpool status is incomplete... Can you please post the complete output ?

Also, one of your vdev is named mirror-6, but already contains 3 drives as Online. Usually, a mirror is made of 2 drives. It is possible to do 3-way mirrors (or even more), but this is not very common. Also, none of your other mirrors are like that. Even when 3-way mirrors are used, they are normally used for the entire pool and not for a single vDev in a larger pool.

So please, confirm the exact structure of your pool before we can figure out what is problematic and what can be done to fix it safely.

OK - I updated the above post to reflect the complete output of the entire JBOD. I thought it was enough to just show the volume of interest...
 

Stephen Hayne

Dabbler
Joined
May 27, 2016
Messages
14
The thing you should have done is to start a new thread.

I think we are going to need to do some CLI (command line) things to get this fixed.

Hot-Spares are only activated by the complete failure of a drive that is having problems. I don't know what you did here, but you appear to have activated a spare, and also somehow added another mirror. It is a really strange looking situation.

I don't know how this happened, because I don't know exactly what steps you took, but I think we can clear this up. Just don't worry about it too much because you have fully three disks holding the data for that vdev right now, so you should't have any risk of data loss.
If you give the model number of the system, I could tell you for sure, but the 48 bay chassis I am familiar with should do this. You need to be able to SSH in from a terminal, I like Cygwin, but you can use PuTTY if you like. It is just so you can use the command line. The command is sesutil locate da10 on to start the light blinking, then sesutil locate da10 off when you are done. It works in systems that have a SAS expander backplane. I have (at work) system from Supermicro, Chenbro and QNAP that all work with that.

Don't need the serial number, just the da#...
If you need to get the device number from the gptid, you can use glabel status to show the gptids and da#..
The drive
Code:
 gptid/5e600107-cb7b-11e5-8fa9-002590c44bba 
is the spare, and the GUI doesn't make it clear but you say,
I would use glabel status to be certain.

From the command line, you can't address the drive by the da# for this, so you must be able to relate the da# you are working on back to the gptid, because the pool was formed using the gptid, not da#. You should be able to give the command,
detach SM2_stable_pool gptid/5e600107-cb7b-11e5-8fa9-002590c44bba
which should return the spare to the spare group.

Please do that and let me know what the result is.
Thank you for the advice about starting another thread. Will do next time.

I didn't want to wait until the drive failed completely, because I have never experienced the "hot spare" process, so was trying to be proactive. I am baffled as to how the spare got activated with only 15 or so errors, but so be it. Also, don't understand why when I canceled out of the GUI, it seemed to put in a drive anyway (v. FreeNAS-9.10.2-U6 (561f0d7a1) ). And, I am not that worried, given the mirror now has 4 drives... when it only needs 2! :)

When running the sesutil, I get this msg, and no lights on. Is there a cmd to give you more info about my system?
Code:
[root@sm2] ~# sesutil locate da10 on
sesutil: No SES devices found


Tomorrow morning, I will go in and try the detach commands, and let you know what happens. I will insert a couple new drives too. I sincerely appreciate your time! I think I am understanding the technique, following the gptid...

I'm just a lowly professor, with no sysAdmin support, trying to manage almost a petabyte of storage for education/research purposes.
--
Dr. Stephen C. Hayne, Professor, CIS, Colorado State University
__!__ (970)491-7511(w) (970)491-5205(f) (970)204-4040(h)
___(_)___ "I love to fly AngelFlights! 310I - N8109M
"http://selfsynchronize.com/hayne/"
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
sesutil: No SES devices found
That is a bummer. It means there is no SAS expander backplane in the chassis. I wish you had the model of the chassis, but I don't think there is a command that will give us that.
Not knowing the drive bay may pose a problem unless your chart of serial numbers tells you which drive is in which bay.
 

Stephen Hayne

Dabbler
Joined
May 27, 2016
Messages
14
Once you have detached the spare (do that first) then you can use the command,
detach SM2_stable_pool gptid/155dee1b-4568-11e7-be80-002590c44bba
to remove the failed disk.
Thank you, Chris! This all worked perfectly as advised. I must say I find the process confusing from the GUI, but much easier to understand from the CLI.

Two more questions:

1) is it best practice to just let the hot spare take over when freeNAS has seen enough errors (whenever that is), or is it better to notice the errors, and pro-actively replace the drive?

2) Since my SuperMicro boxes don't support blinking the LED from the CLI (even though the LEDs blink on disk access), is there a better way for me to manage which drive is in which bay, other than maintaining a spreadsheet of SerialNo?

Stephen
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
is it best practice to just let the hot spare take over when freeNAS has seen enough errors (whenever that is), or is it better to notice the errors, and pro-actively replace the drive?
I don't usually use hot-spares, but on the occasions when I have, I was proactive about activating the hot-spare so I could remove the defective drive and replace it. I have some reporting scripts running that email me with drive health statistics every morning and if you have email notifications setup, you will get an email from the FreeNAS system when certain thresholds are met. My thought has always been that I want to get the malfunctioning drive out of the system before it causes data errors.
Since my SuperMicro boxes don't support blinking the LED from the CLI (even though the LEDs blink on disk access), is there a better way for me to manage which drive is in which bay, other than maintaining a spreadsheet of SerialNo?
I have a few systems that are like that and I keep a chart of which serial number is in which bay as a backup, but I have printed the last four digits of the serial number on little stickers that I put on the drive trays. It is a bit of a pain. All the servers we have bought in the last four or five years support the locate LED and I hope to get the older ones replaced in the next couple years.
Another thing that I have done and it works if the drive is still working at all, take the drive out of the pool and then use dd to write a bunch of data to the drive. You can then look for the drive activity light that is going crazy. Alternatively, if the drive has fully died and not working at all, it should have no activity and you can start a scrub on the pool which will make all the other drives active and the bad drive light will be off.
Many ways to get there.
 

Stephen Hayne

Dabbler
Joined
May 27, 2016
Messages
14
Many ways to get there.
I've tried all these to locate a drive that freeNAS says has "0" disk size, and I've accounted for all the SerialNo's in the box (matching my spreadsheet). "dd if=/dev/da28 of=/dev/null" reports: "dd: /dev/da28: Device not configured"

da28 doesn't show up in glabel status, nor is it in any of my volumes...

Ideas?

Screen Shot 2019-04-26 at 4.30.02 PM.png


Code:
[root@sm2] ~# sas2ircu 4 display | grep Serial
  Channel description                     : 1 Serial Attached SCSI
  Serial No                               : WDWCC136FZXU86
  Serial No                               : S21CNXAG551092F
  Serial No                               : Z1Z9N12Q
  Serial No                               : S2JFNWAG710510B
  Serial No                               : W7306V13
  Serial No                               : W7307L0J
  Serial No                               : W6A0WEZZ
  Serial No                               : W6A0W54H
[root@sm2] ~# sas2ircu 3 display | grep Serial
  Channel description                     : 1 Serial Attached SCSI
  Serial No                               : W6A0W5QK
  Serial No                               : W6A0W8AG
  Serial No                               : W7307KYK
  Serial No                               : S21CNXAG551100L
[root@sm2] ~# sas2ircu 2 display | grep Serial
  Channel description                     : 1 Serial Attached SCSI
  Serial No                               : S21CNXAG551098P
  Serial No                               : W6A0WF33
  Serial No                               : S21CNXAG713369T
  Serial No                               : S2JFNWAG711717W
  Serial No                               : Z1Z9MEFB
  Serial No                               : W6A0W8G6
  Serial No                               : W6A0W5DQ
  Serial No                               : Z1Z9N14M
[root@sm2] ~# sas2ircu 1 display | grep Serial
  Channel description                     : 1 Serial Attached SCSI
  Serial No                               : Z1Z9MSYG
  Serial No                               : MK0371YHK5H9VA
  Serial No                               : S2JFNWAG708729W
  Serial No                               : S21CNXAG713372L
  Serial No                               : W6A0W4Y0
  Serial No                               : W6A0W35Q
  Serial No                               : Z1Z9MT02
  Serial No                               : ZAD7ZS09
[root@sm2] ~# sas2ircu 0 display | grep Serial
  Channel description                     : 1 Serial Attached SCSI
  Serial No                               : W6A0WE9C
  Serial No                               : S2JFNWAG705816N
  Serial No                               : Z1Z9MT0X
  Serial No                               : WDWCC134LLA2JE
  Serial No                               : W6A0W55V
  Serial No                               : S21CNXAG713367M
  Serial No                               : Z1Z9MSZZ
  Serial No                               : WDWCC130DXYNHE
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I've tried all these to locate a drive that freeNAS says has "0" disk size, and I've accounted for all the SerialNo's in the box (matching my spreadsheet). "dd if=/dev/da28 of=/dev/null" reports: "dd: /dev/da28: Device not configured"

da28 doesn't show up in glabel status, nor is it in any of my volumes...
That is very strange. I would say that it is a drive where the internal controller has failed so that it isn't responding but it is still being recognized partially. I have seen that happen two or three times over the years. If you don't have any record of where it is in the chassis, you may need to schedule a maintenance window to shut the server down and take a peek at the drives physically. I would start by looking at the drive bays that have no activity on their light, but there are no guarantees when I drive has an unusual behavior like this.

It might even be that the drive was already removed and it is just still showing in the GUI because something got a little out of whack. I have seen that before too. A drive physically removed from the server but it is still listed in the GUI.
 

Stephen Hayne

Dabbler
Joined
May 27, 2016
Messages
14
That is very strange.
<snip>
It might even be that the drive was already removed and it is just still showing in the GUI because something got a little out of whack. I have seen that before too. A drive physically removed from the server but it is still listed in the GUI.
Yes, I'll bet this is it. I now have 3 or 4 "disconnects" between glabel and the GUI (da8, da10, da28) and glabel shows gptids da37 and da38 that don't show in the GUI at all (View Disks), AND have p1 and p2 - which I don't think they should.

Dang - I guess I will have to bring down a system that has been "up" for 614 days... :)

Code:
 glabel status | sort -k 3 -V
                                      Name  Status  Components
gptid/35476667-a9c7-11e5-9ff8-002590c44bba     N/A  ada0p1
gptid/350a3d9c-a9c7-11e5-9ff8-002590c44bba     N/A  ada1p1
gptid/34ba9c48-a9c7-11e5-9ff8-002590c44bba     N/A  da0p2
gptid/97213b77-a9c6-11e5-9ff8-002590c44bba     N/A  da1p2
gptid/030deda1-eb51-11e8-a953-002590c44bba     N/A  da2p2
gptid/98231417-a9c6-11e5-9ff8-002590c44bba     N/A  da3p2
gptid/317a924e-a9c7-11e5-9ff8-002590c44bba     N/A  da4p2
gptid/5e600107-cb7b-11e5-8fa9-002590c44bba     N/A  da5p2
gptid/0d24df0b-59d1-11e7-be80-002590c44bba     N/A  da6p2
gptid/0df51703-59d1-11e7-be80-002590c44bba     N/A  da7p2
gptid/9760c3e1-a9c6-11e5-9ff8-002590c44bba     N/A  da9p2
gptid/2dc44292-a9c7-11e5-9ff8-002590c44bba     N/A  da11p2
gptid/986614ad-a9c6-11e5-9ff8-002590c44bba     N/A  da12p2
gptid/32516de7-a9c7-11e5-9ff8-002590c44bba     N/A  da13p2
gptid/490dc8cb-59d1-11e7-be80-002590c44bba     N/A  da14p2
gptid/49cf2831-59d1-11e7-be80-002590c44bba     N/A  da15p2
gptid/97a07b97-a9c6-11e5-9ff8-002590c44bba     N/A  da16p2
gptid/2e84ea5f-a9c7-11e5-9ff8-002590c44bba     N/A  da17p2
gptid/98aa97b9-a9c6-11e5-9ff8-002590c44bba     N/A  da18p2
gptid/99765dc8-a9c6-11e5-9ff8-002590c44bba     N/A  da19p2
gptid/30a73711-a9c7-11e5-9ff8-002590c44bba     N/A  da20p2
gptid/11fddb29-c613-11e5-8fa9-002590c44bba     N/A  da21p2
gptid/7618aed4-59d1-11e7-be80-002590c44bba     N/A  da22p2
gptid/76f4dec8-59d1-11e7-be80-002590c44bba     N/A  da23p2
gptid/2f3a11de-a9c7-11e5-9ff8-002590c44bba     N/A  da24p2
gptid/99323483-a9c6-11e5-9ff8-002590c44bba     N/A  da25p2
gptid/340b05fb-a9c7-11e5-9ff8-002590c44bba     N/A  da26p2
gptid/1631a160-4568-11e7-be80-002590c44bba     N/A  da27p2
gptid/2fe8611d-a9c7-11e5-9ff8-002590c44bba     N/A  da29p2
gptid/97e0dc43-a9c6-11e5-9ff8-002590c44bba     N/A  da30p2
gptid/98ee89f7-a9c6-11e5-9ff8-002590c44bba     N/A  da31p2
gptid/332cce85-a9c7-11e5-9ff8-002590c44bba     N/A  da32p2
gptid/12f1687a-c613-11e5-8fa9-002590c44bba     N/A  da33p2
gptid/99b7f4e5-2921-11e7-8bbe-002590c44bba     N/A  da34p2
gptid/2173a885-4568-11e7-be80-002590c44bba     N/A  da35p2
gptid/e234cc5c-62c3-11e9-a953-002590c44bba     N/A  da36p2
gptid/e68c0a6b-61fb-11e5-ba41-002590c44bba     N/A  da37p1
gptid/e69d8811-61fb-11e5-ba41-002590c44bba     N/A  da37p2
gptid/e5296c93-92f2-11e5-8c9d-002590c44bba     N/A  da38p1
gptid/e5436266-92f2-11e5-8c9d-002590c44bba     N/A  da38p2
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Dang - I guess I will have to bring down a system that has been "up" for 614 days... :)
A reboot once in a while is probably a good thing. Longest up-time I ever had was 700 some odd days but I don't remember exactly.
Yes, I'll bet this is it. I now have 3 or 4 "disconnects" between glabel and the GUI (da8, da10, da28) and glabel shows gptids da37 and da38 that don't show in the GUI at all (View Disks), AND have p1 and p2 - which I don't think they should.
The default for FreeNAS is to create two partitions per data disk. Partition 1 (p1) is the swap partition and partition 2 (p2) is the ZFS parition for your data. I do some custom fiddling with my systems and put the swap on a mirrored pair of SSDs instead of having swap partitions on my data disks.
 
Top