FailedOpenDevice SMART Error Troubleshooting

Status
Not open for further replies.

D G

Explorer
Joined
May 16, 2014
Messages
60
Hello,

I just got two emails about the same drive on my server, followed by a warning that my pool is DEGRADED. I am not at home currently, so I can't check much, but I am wondering what my troubleshooting procedure should be when I get home tonight. Here is what the emails said (they are identical and sent at the same time (hour and minute):

Code:
This message was generated by the smartd daemon running on:

    host name:  freenas
    DNS domain: local

The following warning/error was logged by the smartd daemon:

Device: /dev/ada0, unable to open device

Device info:
WDC WD30EFRX-68EUZN0, S/N:_____________, WWN:5-0014ee-003cc9e65, FW:80.00A80, 3.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.


(I removed the serial number)

The details of my build are in my sig, but I am running 9.2.1.7 and the only strange thing to happen lately is that on Saturday it randomly rebooted during the middle of the day.

I have not opened it in quite a while so I can't imagine it is that the cable spontaneously came loose...

I am running the schedule suggested by cyberjock on scrubs and SMART tests, and I haven't received any notifications of any bad behavior prior to this.
 
Last edited:

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Looks like a disk dropped.. Post smart results when you get home.. Maybe the disk died.. I'm on mobile so I cant see hardware..

Sent from my SGH-I257M using Tapatalk 2
 

D G

Explorer
Joined
May 16, 2014
Messages
60
Here are the parts from my sig:

Build: FreeNAS-9.2.1.7-RELEASE-x64
MB: Supermicro X10SEA-O LGA 1150 Motherboard
CPU: Intel Pentium G3420 3.2 GHz
RAM: 16 GB (2 x 8GB) Kingston ValueRAM 1333 MHz DDR3 PC3-10666 ECC
HDD's: 6 x 3TB WD Red drives (in RaidZ2)
Boot Drive: SanDisk Cruzer Fit 16GB USB 2.0
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
When you get home see if you can run a smart test on that disk.. If its still down I'd think that disk has randomly died.. A reboot might bring it back up but I'd be weary of using it.. How long has the new been running for without issue?

Did you pretest the discs before putting em in nas? Please post the results of smart query and tests on that specific disk.. With z2 we can breath easier though..

Sent from my SGH-I257M using Tapatalk 2
 

D G

Explorer
Joined
May 16, 2014
Messages
60
The disk was new as of May when I built the server. I had one disk in the batch replaced already (WD's RMA procedure was actually pretty painless, especially for their Red drives) for having a bad sector that was found in a SMART test a month or so later, but they've all been running smooth since then.

I will definitely try to run a SMART test when I get home, as well as check the results of the latest run since they run regularly. If it's not connected, I can't imagine that I can though. Should I try a reboot, or is there any risk in that?
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
With a z2 shouldn't be any risk for reboot as the pool is already in degrarded state.. I'm curious if the drive is still down or not.. I think I had that pop up once before as well but I was doing something in wevfui which causes a hang.. Hopefully it's just a one time thing..

Sent from my SGH-I257M using Tapatalk 2
 

D G

Explorer
Joined
May 16, 2014
Messages
60
With a z2 shouldn't be any risk for reboot as the pool is already in degrarded state.. I'm curious if the drive is still down or not.. I think I had that pop up once before as well but I was doing something in wevfui which causes a hang.. Hopefully it's just a one time thing..

Sent from my SGH-I257M using Tapatalk 2

I won't be home to check on it for another 4 hours or so, but I will certainly report what I find once I'm able to log in.

So if it reboots and the drive is accessible, the pool is no longer degraded, and the SMART check returns a report of good health, what could have caused it to randomly become unavailable?
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
That's hard to say as you weren't doing anything on the system.. I believe I've seen it once but it was my own doing.. Hopefully just a random glitch.. I'd be watching that drive closely though..

Sent from my SGH-I257M using Tapatalk 2
 

D G

Explorer
Joined
May 16, 2014
Messages
60
Alright, I am home and here is what I'm getting:

Code:
[root@freenas] ~# zpool status
  pool: Prometheus
state: DEGRADED
status: One or more devices has been removed by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: scrub repaired 0 in 11h50m with 0 errors on Mon Sep  1 15:50:18 2014
config:

    NAME                                            STATE     READ WRITE CKSUM
    Prometheus                                      DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/1e034043-dd70-11e3-9986-00259086bb62  ONLINE       0     0     0
        2439661819405573937                         REMOVED      0     0     0  was /dev/gptid/1e5432b0-dd70-11e3-9986-00259086bb62
        gptid/e3ea2fe7-f28f-11e3-9160-00259086c7b6  ONLINE       0     0     0
        gptid/1ef9fa71-dd70-11e3-9986-00259086bb62  ONLINE       0     0     0
        gptid/1f4f07be-dd70-11e3-9986-00259086bb62  ONLINE       0     0     0
        gptid/1fa0a862-dd70-11e3-9986-00259086bb62  ONLINE       0     0     0

errors: No known data errors


Code:
[root@freenas] ~# smartctl -a /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/ada0: No such file or directory
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

[root@freenas] ~# 


So, I rebooted, and everything is reporting healthy:

Code:
[root@freenas] ~# zpool status
  pool: Prometheus
state: ONLINE
  scan: resilvered 102M in 0h0m with 0 errors on Mon Sep  8 19:12:45 2014
config:

    NAME                                            STATE     READ WRITE CKSUM
    Prometheus                                      ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/1e034043-dd70-11e3-9986-00259086bb62  ONLINE       0     0     0
        gptid/1e5432b0-dd70-11e3-9986-00259086bb62  ONLINE       0     0     0
        gptid/e3ea2fe7-f28f-11e3-9160-00259086c7b6  ONLINE       0     0     0
        gptid/1ef9fa71-dd70-11e3-9986-00259086bb62  ONLINE       0     0     0
        gptid/1f4f07be-dd70-11e3-9986-00259086bb62  ONLINE       0     0     0
        gptid/1fa0a862-dd70-11e3-9986-00259086bb62  ONLINE       0     0     0

errors: No known data errors
[root@freenas] ~#


Code:
[root@freenas] ~# smartctl -a /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    
LU WWN Device Id: 5 0014ee 003cc9e65
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  8 19:19:45 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (37980) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 381) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       20
  3 Spin_Up_Time            0x0027   192   171   021    Pre-fail  Always       -       5375
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       41
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2688
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       41
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       26
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   119   111   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       71

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2680         -
# 2  Short offline       Completed without error       00%      2600         -
# 3  Short offline       Completed without error       00%      2362         -
# 4  Extended offline    Completed without error       00%      2274         -
# 5  Short offline       Completed without error       00%      2194         -
# 6  Short offline       Completed without error       00%      2026         -
# 7  Extended offline    Completed without error       00%      1938         -
# 8  Short offline       Completed without error       00%      1859         -
# 9  Short offline       Completed without error       00%      1619         -
#10  Extended offline    Completed without error       00%      1531         -
#11  Short offline       Completed without error       00%      1451         -
#12  Short offline       Completed without error       00%      1283         -
#13  Extended offline    Completed without error       00%      1196         -
#14  Short offline       Completed without error       00%      1115         -
#15  Short offline       Completed without error       00%       899         -
#16  Extended offline    Completed without error       00%       811         -
#17  Short offline       Completed without error       00%       731         -
#18  Extended offline    Completed without error       00%       536         -
#19  Short offline       Completed without error       00%       461         -
#20  Short offline       Completed without error       00%       222         -
#21  Extended offline    Completed without error       00%       135         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


The only difference I see with this drive vs the other 5 in my array is that the Raw Read Error Count is 20, while all the others is still 0. It says the threshold for this value is 51, so should I not be concerned?

I am running new SMART tests now.

Edit-Short test completed with no errors.
Long test is going, but will take about 6 hours to complete...
 
Last edited:

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
That's odd.. Were you in the middle of replacing a drive when you encountered that error? Seems like everything is OK for the most part.. Raw read shouldn't be a major concern Id probably keep an eye and on that drive espesciallynif its the replacement.. You could try running WD tools from USB or something as a second check.. If it passes it probs fine and won't be accepted for RMa unless it has other problems..

Sent from my SGH-I257M using Tapatalk 2
 
Last edited:

D G

Explorer
Joined
May 16, 2014
Messages
60
That's odd.. Were you in the middle of replacing a drive when you encountered that error? Seems like everything is OK for the most part.. Raw read shouldn't be a major concern Id probably keep an eye and on that drive espesciallynif its the replacement.. You could try running WD tools from USB or something as a second check.. If it passes it probs fine and won't be accepted for RMa unless it has other problems..

Sent from my SGH-I257M using Tapatalk 2

That's what is so weird, I was at work, so nothing was going on at all. I just randomly got the email alerts in the middle of the day.
I'm not familiar with the WD tools, but I'll look into that.
 

D G

Explorer
Joined
May 16, 2014
Messages
60
Well, the long test ended early, due to a SMART failure, so it looks like I'll be calling WD tomorrow to RMA a second drive. Maybe I got a bad batch. Here is the smartctl -a output for the drive (it's the same kind of error that warranted an RMA on a different drive back in June):

Code:
[root@freenas] ~# smartctl -a /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    
LU WWN Device Id: 5 0014ee 003cc9e65
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  8 23:58:10 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 115)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline
data collection:         (37980) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 381) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       21
  3 Spin_Up_Time            0x0027   192   171   021    Pre-fail  Always       -       5375
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       41
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2693
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       41
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       26
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   120   111   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       45

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       30%      2692         3585463128
# 2  Short offline       Completed without error       00%      2689         -
# 3  Extended offline    Completed without error       00%      2680         -
# 4  Short offline       Completed without error       00%      2600         -
# 5  Short offline       Completed without error       00%      2362         -
# 6  Extended offline    Completed without error       00%      2274         -
# 7  Short offline       Completed without error       00%      2194         -
# 8  Short offline       Completed without error       00%      2026         -
# 9  Extended offline    Completed without error       00%      1938         -
#10  Short offline       Completed without error       00%      1859         -
#11  Short offline       Completed without error       00%      1619         -
#12  Extended offline    Completed without error       00%      1531         -
#13  Short offline       Completed without error       00%      1451         -
#14  Short offline       Completed without error       00%      1283         -
#15  Extended offline    Completed without error       00%      1196         -
#16  Short offline       Completed without error       00%      1115         -
#17  Short offline       Completed without error       00%       899         -
#18  Extended offline    Completed without error       00%       811         -
#19  Short offline       Completed without error       00%       731         -
#20  Extended offline    Completed without error       00%       536         -
#21  Short offline       Completed without error       00%       461         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I like your rotation of short and long tests. It almost sounds like you got it from me. ;)
 

D G

Explorer
Joined
May 16, 2014
Messages
60
I am running the schedule suggested by cyberjock on scrubs and SMART tests, and I haven't received any notifications of any bad behavior prior to this.

I like your rotation of short and long tests. It almost sounds like you got it from me. ;)


Haha indeed Cyberjock. Although I did credit you in my original post!
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Yep hence the raw read being up der.. Was that your replacement drive or one original?

Sent from my SGH-I257M using Tapatalk 2
 

D G

Explorer
Joined
May 16, 2014
Messages
60
Yep hence the raw read being up der.. Was that your replacement drive or one original?

Sent from my SGH-I257M using Tapatalk 2
It is one of my originals.
 
Status
Not open for further replies.
Top