FAILED SMART self-check. but health pool?

KoreanJesus · Dec 7, 2015

Hi All,

After about a year of operation ive started to get warnings for a disc in my pool /dev/da1

Code:

Dec 7 10:43:18 freenas smartd[3531]: Device: /dev/da1 [SAT], 14 Currently unreadable (pending) sectors
Dec 7 10:43:18 freenas smartd[3531]: Device: /dev/da1 [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Dec 7 11:13:18 freenas smartd[3531]: Device: /dev/da1 [SAT], FAILED SMART self-check. BACK UP DATA NOW!

dmesg Output:
http://pastebin.com/xC863qt8

Smart results for drive

Code:

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p25 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital RE4-GP
Device Model:     WDC WD2002FYPS-02W3B0
Serial Number:    WD-WCAVY5380105
LU WWN Device Id: 5 0014ee 2afa2d392
Firmware Version: 04.01G01
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Dec  7 12:26:51 2015 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  73) The previous self-test completed having
                                        a test element that failed and the test
                                        element that failed is not known.
Total time to complete Offline
data collection:                (42900) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 488) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   231   230   021    Pre-fail  Always       -       10408
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       21
  5 Reallocated_Sector_Ct   0x0033   122   122   140    Pre-fail  Always   FAILING_NOW 619
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10363
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       20
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   124   097   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       496
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       14
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       22

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed: unknown failure    90%     10354         -
# 2  Conveyance offline  Completed without error       00%     10194         -
# 3  Conveyance offline  Completed without error       00%     10030         -
# 4  Conveyance offline  Completed without error       00%      9865         -
# 5  Conveyance offline  Completed without error       00%      9697         -
# 6  Conveyance offline  Completed without error       00%      9531         -
# 7  Extended offline    Completed: read failure       80%      9496         893430786
# 8  Conveyance offline  Completed without error       00%      9366         -
# 9  Conveyance offline  Completed without error       00%      9198         -
#10  Conveyance offline  Completed without error       00%      9033         -
#11  Conveyance offline  Completed without error       00%      8866         -
#12  Conveyance offline  Completed without error       00%      8703         -
#13  Conveyance offline  Completed without error       00%      8536         -
#14  Conveyance offline  Completed without error       00%      8369         -
#15  Conveyance offline  Completed without error       00%      8202         -
#16  Conveyance offline  Completed without error       00%      8034         -
#17  Conveyance offline  Completed without error       00%      7867         -
#18  Conveyance offline  Completed without error       00%      7699         -
#19  Conveyance offline  Completed without error       00%      7532         -
#20  Conveyance offline  Completed without error       00%      7365         -
#21  Conveyance offline  Completed without error       00%      7197         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
/CODE


My System specs are as follows:
FreeNAS Build FreeNAS-9.3-STABLE-201509022158 x64

CPU: Intel Xeon CPU E3-1220 v3
RAM: Crucial CT2KIT102472BD160B 16GB 2x8GB
MOBO: Supermicro MBD-X10SL7-F-O
HDD:
6x WD RE4-GP WD2002FYPS 2TB
4x WD RED  WDC WD20EFRX 2TB


I understand that the drive is failing but what I dont get is that the pool is reporting as good even though a scrub was run.

CODE
  pool: all
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 3.39M in 16h41m with 0 errors on Sun Dec  6 16:41:40 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        all                                             ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/b0b73ff1-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0
            gptid/b0f7f1ee-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0
            gptid/b13bf6a4-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0
            gptid/b18165bb-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0
            gptid/b239ee6e-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0
            gptid/b2bec352-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0
            gptid/b32a9f00-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0
            gptid/b38635c9-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0
            gptid/b3e0eb7a-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0
            gptid/b43dad48-4720-11e4-90e3-001b217bf150  ONLINE       0     0     0

errors: No known data errors

Thanks for you time, If you need anymore info just let me know
-KJ

danb35 · Dec 7, 2015

Yes, the drive is failing. The SMART data shows that you don't have a sensible self-test schedule set up either--the last long self-test was 1000 hours ago, and even that failed. There's really no need for recurring conveyance tests. RMA the drive, if it's still in warranty, or just buy a replacement if it isn't, and replace the drive following the manual's instructions. Why isn't the pool showing an error? Perhaps there's no data stored on one of the failed blocks yet.

rsquared · Dec 7, 2015

I was about to disagree with dan about the test schedule, but it looks like he just missed a zero... If you use the default schedule from FreeNAS, it will run a long test every 1008 hours, but yours hasn't run in nearly 10,000.

On another note, your drive temp is good currently, but the high temp is 55 C, which could have contributed to an early death (power on hours shows about 14 months). If you have other drives that hit high temps for any period of time, I'd start monitoring those closely, as they are more likely to fail early too.

Sent from my Nexus 6 using Tapatalk

danb35 · Dec 7, 2015

rsquared said:
If you use the default schedule from FreeNAS, it will run a long test every 1008 hours, but yours hasn't run in nearly 10,000.

A long test was run (and failed) at 9496 hours; the drive has 10,363 hours of run time. To be precise, that's 867 hours ago. Prior to that, the test log goes back 2300 hours with no other long tests having been run. Is there now a default SMART test schedule with FreeNAS? That's news to me, if so (though it would be a good thing IMO). Every 1008 hours would be every 42 days, which corresponds to the default scrub schedule, but SMART tests and scrubs are completely different things.

It looks like there are conveyance tests running about every 165-170 hours, which would be about weekly. The numbers aren't perfectly uniform, which I'd expect them to be if they were on a schedule, but they're uniform enough that I doubt they're being done manually. I don't know of any reason that they'd be part of a default recurring testing schedule.

KoreanJesus · Dec 7, 2015

rsquared said:
I was about to disagree with dan about the test schedule, but it looks like he just missed a zero... If you use the default schedule from FreeNAS, it will run a long test every 1008 hours, but yours hasn't run in nearly 10,000.

On another note, your drive temp is good currently, but the high temp is 55 C, which could have contributed to an early death (power on hours shows about 14 months). If you have other drives that hit high temps for any period of time, I'd start monitoring those closely, as they are more likely to fail early too.

Sent from my Nexus 6 using Tapatalk

Thanks, yeh looks like the long test schedule wasn't set up correctly, as for the temps, that high temp was from a good while ago the genera; operating temp im running is between 25 and 40 with an email alert triggering if it goes above 40.

Thanks for you input

rsquared · Dec 7, 2015

Too early, not enough coffee... Yeah, I guess I was thinking of the scrub schedule.

KoreanJesus · Dec 7, 2015

danb35 said:
A long test was run (and failed) at 9496 hours; the drive has 10,363 hours of run time. To be precise, that's 867 hours ago. Prior to that, the test log goes back 2300 hours with no other long tests having been run. Is there now a default SMART test schedule with FreeNAS? That's news to me, if so (though it would be a good thing IMO). Every 1008 hours would be every 42 days, which corresponds to the default scrub schedule, but SMART tests and scrubs are completely different things.

It looks like there are conveyance tests running about every 165-170 hours, which would be about weekly. The numbers aren't perfectly uniform, which I'd expect them to be if they were on a schedule, but they're uniform enough that I doubt they're being done manually. I don't know of any reason that they'd be part of a default recurring testing schedule.

Yeh I'm running conveyance tests every week not sure why the times are off as its fully automated

danb35 · Dec 7, 2015

Why are you running conveyance tests every week? The conveyance test is "Intended as a quick test to identify damage incurred during transporting of the device from the drive manufacturer to the computer manufacturer." Since your server is (hopefully) in a fixed location and not being transported from place to place frequently, this test isn't appropriate for your use case. Short tests should be run every few days (no less often than once a week), and long tests no less frequently than once a month. There's no reason to schedule recurring conveyance tests.

KoreanJesus · Dec 7, 2015

danb35 said:
Why are you running conveyance tests every week? The conveyance test is "Intended as a quick test to identify damage incurred during transporting of the device from the drive manufacturer to the computer manufacturer." Since your server is (hopefully) in a fixed location and not being transported from place to place frequently, this test isn't appropriate for your use case. Short tests should be run every few days (no less often than once a week), and long tests no less frequently than once a month. There's no reason to schedule recurring conveyance tests.

Oh, thanks for the info I've switched out the schedule and have short test every week and long test once a month, Also what test would be best to perform with the replacement drive as I've just swapped out the bad one?

Thanks

danb35 · Dec 7, 2015

Follow the burn-in threads on the new drive. At a minimum, I'd want a full run of badblocks and an extended SMART test.

Robert Trevellyan · Dec 7, 2015

And a conveyance test ;)

danb35 · Dec 7, 2015

Yes, good point. I thought I'd mentioned that, but obviously not. A newly-installed disk is the primary purpose for the conveyance test.

Important Announcement for the TrueNAS Community.

FAILED SMART self-check. but health pool?

KoreanJesus

Dabbler

danb35

Hall of Famer

rsquared

Explorer

danb35

Hall of Famer

KoreanJesus

Dabbler

rsquared

Explorer

KoreanJesus

Dabbler

danb35

Hall of Famer

KoreanJesus

Dabbler

danb35

Hall of Famer

Robert Trevellyan

Pony Wrangler

danb35

Hall of Famer

Similar threads

Important Announcement for the TrueNAS Community.

FAILED SMART self-check. but health pool?

Dabbler

Hall of Famer

Explorer

Hall of Famer

Dabbler

Explorer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Pony Wrangler

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FAILED SMART self-check. but health pool?"

Similar threads