How to decode disk serial number

john60

Explorer
Joined
Nov 22, 2021
Messages
85
I unplugged 2 disks and got this error message.
disk not available.png

But these serial numbers do not match the serial numbers when all disks are attached.

Here is the image after I re-plugged in the 2 disks.
my disks.png


Also, an alert appeared after I inserted saying something did not recover.
what does this mean.png

Also my pools show an alert.
pool alert.png


pool status.png


Question 1: How do I convert the serial number in the disk failure alert to the serial numbers displayed in the disk window?
Question 2: How do I clear the red x in the pool window or do I still have a problem? Are the checksum because I disconnected the disk or does it mean something really got messed up when I disconnected my disk for the experiment?
 
Last edited:

john60

Explorer
Joined
Nov 22, 2021
Messages
85
In another post, there was response by joeschmuck and his signature line had hard-drive-troubleshooting-guide-all-versions-of-freenas.
So I am running
smartctl –t long /dev/da1
and
smartctl –t long /dev/da2

I am confused as whether or not there is even an issue. A red x in the Pools tab, and pool status show 2 checksums of 2 and 1 are the only anomalous indication of something not right. I assume that if I'm 1 disk failure away from total data loss there would be more "in your face" alarms.

I would be appreciative of some insight from someone with experiance.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
glabel status
will get you a list of drive identifiers and their GPTID's. From this you can match the affected diskd to their serial numbers and find them in your system.
 
Last edited:

john60

Explorer
Joined
Nov 22, 2021
Messages
85
The web GUI reports both smart test as success.
da1 smart test.png

da2 smart test.png


Pool still shows Unhealthy

pool still shows unhealthy.png


At this point, not sure if the Unhealthy x is real or not. I would appreciate someone with experience commenting.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
In the other thread in which you posted I suggested that you post here in code tags the full output of the long smart tests that you made.

Oh - I see I didn't - meant to but didn't. So, please post the full output here in code tags.
 

john60

Explorer
Joined
Nov 22, 2021
Messages
85
Tried a reboot and the unhealthy icon disappeared.

reboot and unhealthy gone.png


So my remaining questions are:
Q1: Is this how it is supposed to work or am I missing something.
Q2: How do I map the serial number in the original alert to the serial number of on the disk/web GUI?
 

john60

Explorer
Joined
Nov 22, 2021
Messages
85
will get you a list of drive identifiers and their GPTID's. From this you can match the affected diskd to their serial numbers and find them in your system.
Wow, you are a trove of information. Thanks.

glabel.png


These # do not match with the first alert dialog.
I tried converting 3dd917d2 from hex to decimal but that does not match either.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
These # do not match with the first alert dialog.
Yes, I noticed that after I posted - I have no idea what those numbers are that appeared in that first error message.

BTW - are you using TrueNAS Core or Scale - don't think you said and I didn't ask. I know nothing about Scale...

You didn't give us your system info yet - do you have a hot-plug drive arrangement? When you unplugged your drives (as your first post says) were they running or was the server powered down?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I have no idea what those numbers are that appeared in that first error message.
Those are the ZFS partition guids... I think you can look at them with zdb -l /dev/da1p2, but I would not recommend doing that if you didn't already know about it, it's unlikely to be helpful as it's really only displayed when the disk is offline anyway.

If you run that on a pool member disk (data partition), you see that it has information about the other members of its VDEV and a little about the pool.

It's not really helpful to humans though.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Those are the ZFS partition guids...
Thanks for the info - "learning never ends"!

I wonder why the Error Message displays that info instead of the disk guid.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I unplugged 2 disks and got this error message.
First, I hope this was accidental. If not, then stop doing it.

Second, I feel that you should be running a smart long test on every one of your drives, not just da1 & da2. Then inspect the results for Any errors. Read that troubleshooting guide again, I think it's pretty clear on what to do. I doubt you have a drive failure related to it being unplugged but you could have a drive failure for other reasons.

Lastly I would run a SCRUB on your pools (you can do this first if you like). I suspect they will repair themselves if you only removed 2 drives temporarily since you have a RAIDZ2. If you run a Scrub and if you find nothing to repair then you can clear the error message if you are still receiving that message. I would shutdown the system and wait a few minutes, then power it back on, run another scrub and if all is good, then you are done.

That is what I'd do. You know why your pool went crazy, you removed two drives with the power on which is not good and the system yelled at you for it. If it would have been 3 drives, well odds are your data would be gone.
 

john60

Explorer
Joined
Nov 22, 2021
Messages
85
BTW - are you using TrueNAS Core or Scale - don't think you said and I didn't ask. I know nothing about Scale...
I have both, but this experiment was on core.
When you unplugged your drives (as your first post says) were they running or was the server powered down?
My goal was to prepare for what will happen when a real failure occurs in the future. Specifically, If lose 2 drives, how do I figure out what drive failed so I can replace the right one.

I powered down the system, pull the cable from 2 SATA drives. Then powered up the system.
I was hoping to (1) see the serial numbers of the drive I unplugged, (2) that Z2 meant the system still worked.
 

john60

Explorer
Joined
Nov 22, 2021
Messages
85
First, I hope this was accidental. If not, then stop doing it.

Second, I feel that you should be running a smart long test on every one of your drives, not just da1 & da2. Then inspect the results for Any errors. Read that troubleshooting guide again, I think it's pretty clear on what to do. I doubt you have a drive failure related to it being unplugged but you could have a drive failure for other reasons.

Lastly I would run a SCRUB on your pools (you can do this first if you like). I suspect they will repair themselves if you only removed 2 drives temporarily since you have a RAIDZ2. If you run a Scrub and if you find nothing to repair then you can clear the error message if you are still receiving that message. I would shutdown the system and wait a few minutes, then power it back on, run another scrub and if all is good, then you are done.

That is what I'd do. You know why your pool went crazy, you removed two drives with the power on which is not good and the system yelled at you for it. If it would have been 3 drives, well odds are your data would be gone.
I would like to do a fire drill so that I am ready for when there is a real failure.
A real failure will be stressful and learning while stress does not work well for me.

I assume it is safe if I power down the system, unplug both power and sata cable from 2 drives, then power back up?

I repeated the experiment, but only 1 drive powered down, and a different drive.
It now displays the serial number in the alert, not like yesterday.
only 1 disk down, shows serial number.png


Now with 2 drives down, again different drives from yesterday. I get serial numbers, not like yesterday.
2 drives down, but different from yesterday.png



Now power down, re-inserted the 2 drives, powered back up, and running scrub to see if this will wipe the unhealthy flags from the GUI.
This will take some time, so until later.

scrub.png


Tomorrow I will try the same 2 drives as yesterday.
 

john60

Explorer
Joined
Nov 22, 2021
Messages
85
The GUI blanked out, so I re-logged, but the pool window does not show scrub still running.
Maybe it was completed way faster than predicted, so I started another and got this message.

stop the scrub.png


So obviously scrub is still running.
Was there an indication somewhere in the GUI I just missed or is the user expected to try to run it again to see if it is still running?
Where will the scrub results appear?
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
Shell
Run command "zpool status"
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Where will the scrub results appear?

Well this is actually a very good question. I'm reasonably positive the GUI used to indicate this somewhere. I just checked one of our filers known for "the long scrubs" and while it is clearly doing a scrub as indicated from the CLI, I am not seeing an indication of this in the GUI. Perhaps I am just not yet fully awake this morning...?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I just checked one of our filers known for "the long scrubs" and while it is clearly doing a scrub as indicated from the CLI, I am not seeing an indication of this in the GUI. Perhaps I am just not yet fully awake this morning...?
It's there if it was the last run activity on that pool...
1641908432701.png
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yeesh, that's buried. But you're right, it's there if you dig.
 

john60

Explorer
Joined
Nov 22, 2021
Messages
85
It's there if it was the last run activity on that pool...

yes, I found it now. For me it reports the scrub as zero errors.
scrub finished.png


But it is interesting that a critical alarms remains even though the disk were plugged back in
still critical alarm.png


and still the red x
still the x.png


I logged out of the GUI and logged back in but these alarms remain.


Yesterday a restart of truenas core cleared the red x. So I did a restart again today.
restart like yesterday.png


and now the red is gone just like yesterday
now the red x is gone.png


So it would appear that if a power down, remove disk, power up see alarms, power down, put back disk, you will get a red x, scrub, x remains, and another reboot is required to clear the red x.

Is this 2 bugs or does Truenas really require a 2nd reboot to clear the 'red (unhealthy) x' in the Pool window and sometime reports funky serial numbers?
Recall yesterday
- a "smartctl –t long /dev/da1 " failed to clear the 'red (unhealthy) x'.
- a funky serial number was reported for the removed disk.

I need to try unplugging the same 2 disks as yesterday to see if the funky serial numbers appear again.
 
Top