Yellow alert notice for volume status.

HAL9000 · Feb 24, 2013

Current System setup
MB: P8P67-M
RAM: 4X4GB DDR3 G.Skills Ripjaws
CPU: G530 Intel Celeron
HDD: 8x 2TB WD Green, 3x 3TB WD Red, 1x 4TB WD Green,
HDD Controllers: LSI SAS 9207-8i and MB SATA

Problem: The volume Mango1 (ZFS) status is UNKNOWN:

Mango1 is a Raidz2 with 8x 2TB WD Green.

I ran zpool status -v Mango1 and got the following result.

UM
Mango1 ONLINE 0 0
0
raidz2-0 ONLINE 0 0
0
gptid/4dc18a2c-705c-11e2-97be-f46d04d6885d ONLINE 0 0
0
gptid/4e3665c0-705c-11e2-97be-f46d04d6885d ONLINE 0 0
0
gptid/4ee8eba6-705c-11e2-97be-f46d04d6885d ONLINE 0 0
2
gptid/4fa92bcf-705c-11e2-97be-f46d04d6885d ONLINE 0 0
0
gptid/5067db40-705c-11e2-97be-f46d04d6885d ONLINE 0 0
0
gptid/50d737db-705c-11e2-97be-f46d04d6885d ONLINE 0 0
0
gptid/513e1a1a-705c-11e2-97be-f46d04d6885d ONLINE 0 0
0
gptid/51f9a6e1-705c-11e2-97be-f46d04d6885d ONLINE 0 0
0

errors: No known data errors

Scrub of Mango1 is set for every 7 days.

Any suggestions on whether this is an issue and what I need to do to correct the problem?

Thank you,

fede2222 · Feb 25, 2013

HAL9000 said:
C
gptid/4ee8eba6-705c-11e2-97be-f46d04d6885d ONLINE 0 0
2

Scrub of Mango1 is set for every 7 days.

Any suggestions on whether this is an issue and what I need to do to correct the problem?

Thank you,

I assume that the yellow mark is from the "2" of this GPTID... that 2 of the checksum. With the scrub surely will be corrected with the replicas. You can wait 7 days for the scrub or do it manual (personally i dont wait with this things).

A device was lost recently?... check dmesg... i have some like this in the past and was a HDD that disconnect and reconnect itself (lost device in dmesg) and get desynchronized

cyberjock · Feb 25, 2013

I'm 99% sure it wasn't the 2 checksum errors. Frankly, I'm kind of at a loss as to why you have the yellow sign. I'd check dmesg and smart disk info to see if anything is wrong. Other than that, I'd wait and see if it goes back to green on its own.

HAL9000 · Feb 25, 2013

results from dmesg

Code:

da5: <ATA WDC WD20EARS-22M AB51> Fixed Direct Access SCSI-6 device              
da5: 300.000MB/s transfers                                                      
da5: Command Queueing enabled                                                   
da5: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)                
da6 at mps0 bus 0 scbus0 target 6 lun 0                                         
da6: <ATA WDC WD20EARS-00M AB51> Fixed Direct Access SCSI-6 device              
da6: 300.000MB/s transfers                                                      
da6: Command Queueing enabled                                                   
da6: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)                
da7 at mps0 bus 0 scbus0 target 7 lun 0                                         
da7: <ATA WDC WD20EARS-00S 0A80> Fixed Direct Access SCSI-6 device              
da7: 300.000MB/s transfers                                                      
da7: Command Queueing enabled                                                   
da7: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)                
da8 at umass-sim0 bus 0 scbus9 target 0 lun 0                                   
da8: <Kingston DT Micro PMAP> Removable Direct Access SCSI-4 device             
da8: 40.000MB/s transfers                                                       
da8: 7498MB (15356160 512 byte sectors: 255H 63S/T 955C)                        
SMP: AP CPU #1 Launched!                                                        
GEOM: da8s1: geometry does not match label (16h,63s != 255h,63s).               
Trying to mount root from ufs:/dev/ufs/FreeNASs1a                               
ZFS filesystem version 5                                                        
ZFS storage pool version 28                                                     
[root@freenas ~]#

The upper half is missing, due to the shell window size. Is there way to export the dmesg results to a log?

This is going to sound stupid, but how do I check the smart disk info?

On the bright side, the light is back to green again. After a reboot and manual scrub. Will keep post if this happens again.

Thanks for the advice and comments.

ProtoSD · Feb 25, 2013

HAL9000 said:
The upper half is missing, due to the shell window size. Is there way to export the dmesg results to a log?

dmesg > path-to-your-folder/name-of-logfile.txt

HAL9000 said:
This is going to sound stupid, but how do I check the smart disk info?

smartctl -a

or google "freebsd man smartctl"

gpsguy · Feb 25, 2013

SSH to the box and capture the output.

HAL9000 said:
Is there way to export the dmesg results to a log?

I use PuTTY on my pc. One can change the logging, so it captures say, "printable output" to a file. Or use the "copy all to clipboard" from the session window and paste it into another file.

HAL9000 · Feb 26, 2013

thanks protosd and gpsguy.

The yellow light came back this morning and I now know why. I set up the system to send status emails and got the following result.

Code:

Disk status:
Filesystem             Size    Used   Avail Capacity  Mounted on
/dev/ufs/FreeNASs1a    926M    383M    468M    45%    /
devfs                  1.0k    1.0k      0B   100%    /dev
/dev/md0               4.6M    3.2M    977k    77%    /etc
/dev/md1               823k    2.5k    755k     0%    /mnt
/dev/md2               149M     10M    126M     8%    /var
/dev/ufs/FreeNASs4      19M    1.1M     17M     6%    /data
Kiwi2                   10T    4.3T    6.5T    40%    /mnt/Kiwi2
Mango1                  10T      3T    7.2T    29%    /mnt/Mango1

Last dump(s) done (Dump '>' file systems):

Checking status of zfs pools:
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
Kiwi2   10.9T  4.25T  6.63T    39%  1.00x  ONLINE  /mnt
Mango1  14.5T  4.14T  10.4T    28%  1.00x  ONLINE  /mnt

  pool: Mango1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Feb 24 10:05:02 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        Mango1                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/4dc18a2c-705c-11e2-97be-f46d04d6885d  ONLINE       0     0     0
            gptid/4e3665c0-705c-11e2-97be-f46d04d6885d  ONLINE       0     0     0
            gptid/4ee8eba6-705c-11e2-97be-f46d04d6885d  ONLINE       0     0   139
            gptid/4fa92bcf-705c-11e2-97be-f46d04d6885d  ONLINE       0     0     0
            gptid/5067db40-705c-11e2-97be-f46d04d6885d  ONLINE       0     0     0
            gptid/50d737db-705c-11e2-97be-f46d04d6885d  ONLINE       0     0     0
            gptid/513e1a1a-705c-11e2-97be-f46d04d6885d  ONLINE       0     0     0
            gptid/51f9a6e1-705c-11e2-97be-f46d04d6885d  ONLINE       0     0     0

errors: No known data errors

Checking status of ATA raid partitions:

Checking status of gmirror(8) devices:

Checking status of graid3(8) devices:

Checking status of gstripe(8) devices:

Security check:
    (output mailed separately)

Checking status of 3ware RAID controllers:
Alarms (most recent first):
  No new alarms.

Looks like one of my HDD has a checksum problem. Time to start trouble shooting. I am going to try the following unless anyone has a better suggestion.
1) check cables
2) offline drive, format, put back online and resilver
3) run smart test if yellow light comes back
4) replace drive if it fails smart test. Check warranty on failed drive for rma potential.

As always, thanks for taking look.

cyberjock · Feb 26, 2013

I've seen that when you get a bunch of checksum errors thats usually in indicator of a drive that became detached and reattached to the system. I think your list is sufficient to find the problem. If you have a spare cable I'd replace it on that bad drive just because you can. It's not like SATA cables are expensive.

But to clarify, I'd run a short SMART test, then a long one before you resilver. I'd also think that if you are able to complete the resilvering without any issues then its probably not the drive itself but something else. Resilvering/scrubbing is some serious loading on the drives. If they're flaky they'd be likely to fail then.

Another idea.. if that drive is on the 3ware controller you could try moving it to the motherboard SATA.

HAL9000 · Feb 27, 2013

thanks cyberjock. I switched the drive over to a mb sata connection and used a new cable. Not looking forward to replacing the old cable as it is a 4xSATA SAS cable and hope the problem is not the 3ware controller, but it is new and under warranty. Did not get the chance to run a short and long SMART test after taking the drive off line. I wiped the drive and pressed the replace command. Got a messaging about needing to detach and then suddenly it was resilvering. After a couple of hours the status light is back to green.

I still intend to run the short and long SMART test to check the drive and will give the drive a week for errors from detaching itself. After that I will switch back to the SAS cable and wait for errors again. In the interim, time to order up a couple of spare SAS cables.

SMART Short test results.

Code:


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(39660) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 382) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0027   192   167   021    Pre-fail  Always       -       5400
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1578
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15610
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       333
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       318
193 Load_Cycle_Count        0x0032   186   186   000    Old_age   Always       -       42779
194 Temperature_Celsius     0x0022   120   105   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   186   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     15610         -
# 2  Extended offline    Completed without error       00%     15563         -
# 3  Extended offline    Completed without error       00%     15351         -
# 4  Extended offline    Completed without error       00%     15330         -
# 5  Extended offline    Completed without error       00%     15303         -
# 6  Extended offline    Completed without error       00%     15278         -
# 7  Extended offline    Completed without error       00%     15254         -
# 8  Extended offline    Completed without error       00%     15230         -
# 9  Extended offline    Completed without error       00%     15209         -
#10  Extended offline    Completed without error       00%     15185         -
#11  Extended offline    Completed without error       00%     15148         -
#12  Extended offline    Completed without error       00%      5830         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

HAL9000 · Feb 27, 2013

just noticed this in the Freenas manual

NOTE: to prevent problems, do not enable the S.M.A.R.T. service if your disks are controlled by a RAID controller as it is the job of the controller to monitor S.M.A.R.T. and mark drives as Predictive Failure when they trip.

I had set up a long SMART test for all drives once a month. If this was the cause it should have triggered more than one drive to detach, no?

HAL9000 · Feb 28, 2013

SMART long test results

Code:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(39660) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 382) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0027   192   167   021    Pre-fail  Always       -       5400
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1578
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15621
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       333
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       318
193 Load_Cycle_Count        0x0032   186   186   000    Old_age   Always       -       42780
194 Temperature_Celsius     0x0022   117   105   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   186   000    Old_age   Offline      -       3

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     15617         -
# 2  Short offline       Completed without error       00%     15610         -
# 3  Extended offline    Completed without error       00%     15563         -
# 4  Extended offline    Completed without error       00%     15351         -
# 5  Extended offline    Completed without error       00%     15330         -
# 6  Extended offline    Completed without error       00%     15303         -
# 7  Extended offline    Completed without error       00%     15278         -
# 8  Extended offline    Completed without error       00%     15254         -
# 9  Extended offline    Completed without error       00%     15230         -
#10  Extended offline    Completed without error       00%     15209         -
#11  Extended offline    Completed without error       00%     15185         -
#12  Extended offline    Completed without error       00%     15148         -
#13  Extended offline    Completed without error       00%      5830         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

HAL9000 · Mar 5, 2013

Well, I went back to the SAS cable and got this warning for the 3Ware RAID controller. The message and time is identical to when I started having a the one HDD disconnect and reconnect. So far the pool is still healthy.

Checking status of 3ware RAID controllers:
Alarms (most recent first):
+++ /var/log/3ware_raid_alarms.today 2013-02-24 03:01:01.000000000 -0500
@@ -0,0 +1 @@
+

Any chance this is due to a setting in FreeNas instead of the SAS cable? Or worse, a faulty LSI raid controller?
Thanks in advance.

HAL9000 · Mar 5, 2013

Never mind. False alarm. I should have google it first.
According to this post there is no alarm as my log is empty as well.
http://forums.freenas.org/archive/index.php/t-8490.html
http://forums.freenas.org/archive/index.php/t-1083.html

Important Announcement for the TrueNAS Community.

Yellow alert notice for volume status.

HAL9000

Dabbler

fede2222

Dabbler

cyberjock

Inactive Account

HAL9000

Dabbler

ProtoSD

MVP

gpsguy

Active Member

HAL9000

Dabbler

cyberjock

Inactive Account

HAL9000

Dabbler

HAL9000

Dabbler

HAL9000

Dabbler

HAL9000

Dabbler

HAL9000

Dabbler

Similar threads