Resource icon

Disk failure LEDs for Supermicro SAS backplanes 0.3

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Still though, I think the sesutil command brings the FreeNAS team one step closer to being able to put a button in place to light up specific drives for maintenance!
In an ideal scenario, the faulted drive would be marked automatically without you needing to click a button, but having a button to click would be a nice feature so you can identify a specific drive if you wanted to replace it when it isn't faulted. Several of the hardware RAID controllers (and some other NAS vendors) have a feature for that, in the event you want to replace a drive with a larger drive.
I took a look at the link for your chassis, not pretty from the front but it looks to be well utilized space! (And better priced!)
If this guy ever gets some more of them, he was accepting a best offer price of $350.
https://www.ebay.com/itm/Server-Chenbro-48-Bay-Top-Loader-4U-Chassis-NEW-in-Box-/253503074088
 

Visseroth

Guru
Joined
Nov 4, 2011
Messages
546
I would think it would be a standard "available" feature. In the case of FreeNAS I understand that it won't always work but why not have it available and maybe put somewhere in the GUI a place to change the command so if you have a sas2 backplane, sas3 backplane, or some other back plane you can change the command to meet your system's requirements?

If this guy ever gets some more of them, he was accepting a best offer price of $350.
I wouldn't mind, but I'm happy with what I have so far. It works and isn't even full yet, with no plans to fill it at this point.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Guess I will contribute a little bit on this topic before sleep. I found that sas2ircu does not work for all of my backplanes while sesutil do so I would love to make sesutil work.

zpool status | grep -E "(DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|ONLINE)" | grep -vE "($pool|NAME|mirror|raidz|stripe|logs|spares|state)" | ./fixer.sh I included "ONLINE" for easy testing and obviously this should be removed in final scripts. Output should like this
Code:

			mfisyspd0p2  ONLINE	   0	 0	 0
			mfisyspd1p2  ONLINE	   0	 0	 0
			da6p2							   ONLINE	   0	 0	 0
			da7p2							   ONLINE	   0	 0	 0
			da8p2							   ONLINE	   0	 0	 0
			da2p2							   ONLINE	   0	 0	 0
			da10p2							  ONLINE	   0	 0	 0
			da11p2							  ONLINE	   0	 0	 0
			da3p2							   ONLINE	   0	 0	 0
			da5p2							   ONLINE	   0	 0	 0
		  nvd0										  ONLINE	   0	 0	 0


Obviously you need to set the $pool variable and have fixer.sh available like in the OP. Then next thing is strip off anything behind "(DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|ONLINE)" and p+X (X being any number). The last step would only be issue sesutil locae <diskname> on
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
OK, I think I made a bit of progress here.
For disk that are removed|offline|unavail, zpool status will replace device name in the first column with a number and append something like was /dev/xxxxxxxxx,like
Code:

		NAME											STATE	 READ WRITE CKSUM
		m8											  DEGRADED	 0	 0	 0
		  mirror-0									  DEGRADED	 0	 0	 0
			7022280606660962355						 REMOVED	  0	 0	 0  was /dev/gptid/1956925c-c9e5-11e8-b164-ecf4bbcb7e90
			gptid/1a38d15c-c9e5-11e8-b164-ecf4bbcb7e90  ONLINE	   0	 0	 0


this can be processed with
zpool status | grep -vE "REMOVED" | awk -F'was /dev/' '{print $2}' | ./fixer.sh | awk -F'p[0-9]' '{print $1}' | awk 'NF'
Unfortunately this cannot work with removed disk because its removed from the geom list

With disk in Faulted state, this should look like (according to this example https://forums.freebsd.org/threads/zpool-clear-doesnt-affect-faulted-disk.57347/)
Code:
	   NAME											STATE	 READ WRITE CKSUM
	   stuff										   DEGRADED	 0	 0	 0
		 raidz1-0									  DEGRADED	 0	 0	 0
		   gptid/54e55c16-5275-11e5-bf1a-10c37b9dc3be  ONLINE	   0	 0	 0
		   ada2p3									  FAULTED	  0	 0	 0  too many errors
		   ada1p8									  ONLINE	   0	 0	 0

this can be processed with zpool status | grep -E "(DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|ONLINE)" | grep -vE "($pool|NAME|mirror|raidz|stripe|logs|spares|state|was /dev/)" |awk -F'(DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|ONLINE)' '{print $1}' | ./fixer.sh | awk -F'p[0-9]' '{print $1}' | awk 'NF'

I am not sure how other states like DEGRADED|FAIL|DESTROYED are displayed for disks, or whether they are applicable. If anyone can give me an example output that would be helpful
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
@danb35 I have installed this script on my system and it is giving me some notes that I am guessing might be errors. Is the version at github.com/danb35/zpscan current?
This is the output when I run it from the terminal:
Code:
Saving drive list.
Usage: grep [ OPTION]... PATTERN [ FILE]...
Try `grep --help' for more information.
Usage: grep [ OPTION]... PATTERN [ FILE]...
Try `grep --help' for more information.
I am not sure if this actually represents an error. Thought I would check.

Update = I may have answered the question. It only happens on the pool that has a partitioned PCI-E NVMe drive for SLOG and L2ARC.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Lemme guess, it's trying to run something on an nvd device it got from the list of disks in the pool and stuff breaks from there in really weird ways.

For extra fun, I still don't know of any way to associate nvd devices with their nvme parents. And I went digging in the source for both drivers - somehow, there's no blob of code obviously responsible for keeping track of this and none of the structures seem to have a way of keeping track of this.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Is the version at github.com/danb35/zpscan current?
Yes.
I am not sure if this actually represents an error.
It does. Though I'm curious--you say it's only happening on a pool with a NVMe device. Does it work properly (including lighting the appropriate lights) on the other pool(s)? Because IIRC, you aren't using Supermicro hardware--it'd be interesting to see that this works (to a degree) on other kit.
 

Sjöhaga

Dabbler
Joined
Apr 17, 2016
Messages
41
I am no wizz but I did find a few things. Not sure I fixed them or broke them some more but here it is :)

The condition test is made case insensitive which means it will trigger on a healthy pool that has not had its feature flags updated (the zpool status return contains the word unavailable as part of that message). Removing the -i solves that, not sure if that introduces another problem but afaik all error statuses from zpool is in upper case as listed in the script.

The error
Code:
 rm: glabel-lookup.sed: No such file or directory
is caused by running the script twice from cron, (two scripts tries to remove the same file, the second one will fail).

The grep error above happens for any drive not connected to a sas controller, ie. sg_vpd does not return a valid response so $sasaddr is empty

I made a quick'n'dirty mod for my own use where I loop all pools in one script, plus added a test on $sasaddr as I have ssd connected to the mobo sata ports that would give the same error as Chris above.

Attached for everyone's amusement ;)
 

Attachments

  • zpscan2.txt
    2.9 KB · Views: 508

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Does it work properly (including lighting the appropriate lights) on the other pool(s)? Because IIRC, you aren't using Supermicro hardware--it'd be interesting to see that this works (to a degree) on other kit.
I would say that my testing is incomplete, but I have been looking at the script and it is using the same command I used manually:
Line 38: sas2ircu 0 locate $location ON
Line 60: sas2ircu 0 locate $loc OFF
I expect it to work on my Chenbro chassis at home because of my experience when I put together the resource I recently posted.
I have a system at work that is using a SAS3 controller for two enclosures and a SAS2 controller for four more enclosures, 122 drives in all, and I don't think this script will handle both. I will see what I can figure out though. I will be doing some testing though. I want to make it where someone other than me can support these systems at work and I need to test and document the setup and support processes.

Edit - The four enclosures on the SAS2 controller are around 3 year old 16 bay QNAP enclosures that connect by external SAS and the locate method from sas2ircu works on those. I had a drive that I wanted to change out on Friday and I had to remember how to locate the drive, hence the resource. It has been a while and I couldn't remember the easy way with sesutil. I blame it on age and having too many other things to remember.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Attached for everyone's amusement ;)
Looks like a nice mod to the script.
What kind of SAS backplane are the drives connected to in your system?
Have you manually tested the locate method to ensure that the LED comes on?
 

Sjöhaga

Dabbler
Joined
Apr 17, 2016
Messages
41
I have a supermicro SAS2 backplane with no markings or serial number on it :) But it does match all pictures of their dual channel sas2 backplane S-846EL2, sesutil lists it as LSI CORP SAS2X36 0717.

And yes I can turn off and on the leds using both sas2ircu method and the sesutil method.
 
Joined
Mar 6, 2019
Messages
1
Thank you for this script and the work associated with this. I always feel bad reporting bugs/issues without solutions and I defintely am a fan of anything that creates/contributes things like this. I also want to point out that I'm not a developer and haven't spent all that much time looking at this so my apologies if I'm way off-base here or have anything wrong.

I've been using this script for a while now and in testing it worked great but I just had another drive fail and have now noticed a few flaws in the script. In my most recent drive failure the drive completely disappeared from the system. Unfortunately this means that the drive isn't present in glabel output so there is no way to convert the gptid listed in "zpool status" to the device name and therefor no way for the script in its current form to for it to turn on the failure LED.

I believe that the script could still function in this situation if it only relied on glabel output on a healthy pool (in the else portion of the main if statement) and then included the gptid of drives in the $drivesfile. That way when there was an issue with one of the pools (the matching/first portion of the main if statement) it could search the $drivesfile for the gptid of the missing drive to directly grab the enclosure address. There may be other ways to handle this but in my head that seems to make the most sense. For the time being I think this caveat should be noted in the script header so that people are aware of it.

I personally don't see myself having time to work this out in the near future (next few weeks) though I'm definitely adding it to my todo list. If I do get this working I will report back here with any changes I've come up with.

The other weakness I can see with this script is multiple drive failures in the same pool. If you replace one of the failed drives I don't believe it would turn off that LED because the "else" portion of the main if statment (where LEDs are turned off) is never evaluated until your array is healthy. I believe fixing this will require a bit more of a script rework. I would approach it by removing the if statement entirely. Each run you could generate a new list of failed drive, compare it with the copy on last run, and then disable/enable LEDs for drives that appeared/disappeared from the list. I personally don't care too much about this as I intend to keep a healthy pool with only one failed drive at a time. Again it would be best if this caveat was noted in the script header.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I personally don't see myself having time to work this out in the near future (next few weeks) though I'm definitely adding it to my todo list. If I do get this working I will report back here with any changes I've come up with.
Thank you for taking a look at it.
 
Joined
Jan 18, 2017
Messages
525
oh dear it seems my update to 12.0-U5 has broken the script
Code:
./zpscan.sh: line 21: /sbin/zpool: No such file or directory
Saving drive list.
Usage: grep [OPTION]... PATTERN ...
Try `grep --help' for more information.
Usage: grep [OPTION]... PATTERN ...
Try `grep --help' for more information.
Usage: grep [OPTION]... PATTERN ...
Try `grep --help' for more information.
Usage: grep [OPTION]... PATTERN ...
Try `grep --help' for more information.
Usage: grep [OPTION]... PATTERN ...
Try `grep --help' for more information.
Usage: grep [OPTION]... PATTERN ...
Try `grep --help' for more information.
 
Top