Identify a failed drive.

Status
Not open for further replies.

kspare

Guru
Joined
Feb 19, 2015
Messages
508
I have a super micro storage chassis with 24 slots.


slot 18 should be da18 as labeled on the chassis but it actually comes up as da23.

So in the event of a drive failure how do you know which drive to pull and replace?

There doesn't seem to be a good way to positively identify the drive with a light like a raid controller does.

All we have come up with is to either remove/offline the drive and make some io load on the array so you can see the single drive that isn't doing anything?

I tried the sas3ircu tool for our card but it doesn't identify a controller in the server.

Looking for ideas that others use?
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
As in a label on the front of the drive tray or in freenas?
 

ser_rhaegar

Patron
Joined
Feb 2, 2014
Messages
358
Front of the drive tray. Then you can use FreeNAS to look up which drive is failed and its serial number then match that to the drive tray front labels.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
After a reboot, it seems like all the drives match now. Frustrating. I'm still going to label the drives, it seems to just make things easier.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
...or make up a chart correlating ada*/da* with disk serial numbers, and/or disk serial numbers with physical locations in your server.
 

ser_rhaegar

Patron
Joined
Feb 2, 2014
Messages
358
...or make up a chart correlating ada*/da* with disk serial numbers, and/or disk serial numbers with physical locations in your server.
da## can change on boot. FreeNAS uses the gptid label on the drives and not da## (example can be seen via zpool status). I'd go purely by looking up the serial of the failed drive via FN and match it to the physical drive label or disk tray label with serial.
 
  • Like
Reactions: TAC

kspare

Guru
Joined
Feb 19, 2015
Messages
508
Loading up my label maker...
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Actually I made a script to do the table for you (GPTID, label, serial), look at the "Useful Scripts" link in my sig if you want ;)
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
I do use that script! but I had a drive mismatched from a reboot. it's just a little worry some to pull the wrong drive. The idea to label them works for me!
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Never use the label name to identify a drive ("da1" or "ada1" for example) because it can change from reboot to reboot. Use only the GPTID (used by ZFS to identify each drive, see the command zpool status for example) and the serial number (it's a good idea to put it somewhere on the drive you can easily see it without unpluging the drive) ;)
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
I'm using the physical drive serial number to label the drives on the front of the drive sled. we shouldn't be able to screw that up!
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Unless someone swap the stickers... :D
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
who would do something so nasty!
 
Joined
Oct 2, 2014
Messages
925
I have a spread sheet i keep both on the server, on my laptop, and on a flash drive that has what tray harddrive X is in with all its info: Manufacturer, size, model, serial, purchase date, warranty expiration date.

If and when a drive fails, i will refer to this spreadsheet, pull the drive, and go from there.
 

peterh

Patron
Joined
Oct 19, 2011
Messages
315
Another way is to use the /boot/device/hints file to actually number the controllers and drives

This is qa snippet of a freebsd system i use :
# added by genhints, paste into /boot/device.hints
# now mit pass devices
hint.scbus.0.at="mvsch0"
hint.ada.0.at="scbus0"
hint.pass.0.at="scbus0"
hint.scbus.1.at="mvsch1"
hint.ada.1.at="scbus1"
hint.pass.1.at="scbus1"
hint.scbus.2.at="mvsch2"
hint.ada.2.at="scbus2"
hint.pass.2.at="scbus2"
hint.scbus.3.at="mvsch3"
hint.ada.3.at="scbus3"
hint.pass.3.at="scbus3"
hint.scbus.4.at="mvsch4"
hint.ada.4.at="scbus4"

and so on ..
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Another way is to use the /boot/device/hints file to actually number the controllers and drives

This is qa snippet of a freebsd system i use :
# added by genhints, paste into /boot/device.hints
# now mit pass devices
hint.scbus.0.at="mvsch0"
hint.ada.0.at="scbus0"
hint.pass.0.at="scbus0"
hint.scbus.1.at="mvsch1"
hint.ada.1.at="scbus1"
hint.pass.1.at="scbus1"
hint.scbus.2.at="mvsch2"
hint.ada.2.at="scbus2"
hint.pass.2.at="scbus2"
hint.scbus.3.at="mvsch3"
hint.ada.3.at="scbus3"
hint.pass.3.at="scbus3"
hint.scbus.4.at="mvsch4"
hint.ada.4.at="scbus4"

and so on ..

Yeah, no, do not do that. This becomes an unmaintainable nightmare. There's a good reason serial numbers are provided in the UI. In a crisis, inadvertently bonking the wrong disk means you are dropping a second disk out of the array. Does that seem like a good idea when your data has already lost some redundancy?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I have a spread sheet i keep both on the server,
This is what I do, except that the spreadsheet is in Google Docs so I only maintain one copy. It does mean I need to maintain it when I change anything in the system (add/replace disks, move ports on a controller, etc.), but it's not much that needs upkeep.

Don't have gptid in that spreadsheet, though--might be worth adding...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I have a super micro storage chassis with 24 slots.


slot 18 should be da18 as labeled on the chassis but it actually comes up as da23.

So in the event of a drive failure how do you know which drive to pull and replace?

There doesn't seem to be a good way to positively identify the drive with a light like a raid controller does.

All we have come up with is to either remove/offline the drive and make some io load on the array so you can see the single drive that isn't doing anything?

I tried the sas3ircu tool for our card but it doesn't identify a controller in the server.

Looking for ideas that others use?

In theory you can use the enclosure management features to cause the drive to be identified, but this is complicated to accomplish and often isn't done, or done correctly.

As previously noted, marking serial numbers on the sleds is a time-honored solution.

It may also be possible to have someone stand in front of the array and then cause artificial activity. In the event of a drive that has a hard error but is still generally accessible, you do something like "dd if=/dev/daXX of=/dev/null bs=1048576" and watch for the solid activity light. This is vaguely risky because your server might be heavily accessing some other disks at that same moment, so usually this is done yourself in-person, or with a NOC monkey on the phone, and you do an "now it's on" "now it's off" "and now it's on again" just to be sure.

You can also do dd's on all the OTHER drives and look for the unlit drive.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Don't have gptid in that spreadsheet, though--might be worth adding...

Yep, you should add them, it's the only thing with the serial that doesn't change with reboots ;)
 
Status
Not open for further replies.
Top