smartd reports offline uncorrectable sectors, but not showing up in smartctl

Status
Not open for further replies.

qwerion

Dabbler
Joined
Jan 30, 2014
Messages
19
Got a lovely email overnight corresponding to the following error in the messages log:
Code:
Nov 18 00:34:46 freenas smartd[2845]: Device: /dev/ada0, 8 Currently unreadable (pending) sectors
Nov 18 00:34:46 freenas smartd[2845]: Device: /dev/ada0, 8 Offline uncorrectable sectors
Nov 18 00:34:46 freenas smartd[2845]: Device: /dev/ada0, 8 Currently unreadable (pending) sectors
Nov 18 00:34:46 freenas smartd[2845]: Device: /dev/ada0, 8 Offline uncorrectable sectors

And a big red critical alert in the GUI.

But...zpool status has everything still online, and I'm not actually seeing this in smartctl?
Code:
$ sudo smartctl -a /dev/ada0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Desktop HDD.15
Device Model:     ST4000DM000-1F2168
Serial Number:    S301BMGT
LU WWN Device Id: 5 000c50 0807c7b78
Firmware Version: CC54
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 18 11:29:36 2015 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 525) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       35086976
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       24
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   068   060   030    Pre-fail  Always       -       6804147
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1035
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       24
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   064   058   045    Old_age   Always       -       36 (Min/Max 19/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       63
194 Temperature_Celsius     0x0022   036   042   000    Old_age   Always       -       36 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1034h+49m+07.586s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1512806165
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       39894388448

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1035         -
# 2  Short offline       Completed without error       00%       761         -
# 3  Short offline       Completed without error       00%       688         -
# 4  Short offline       Completed without error       00%       520         -
# 5  Short offline       Completed without error       00%       360         -
# 6  Short offline       Completed without error       00%       135         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Also nothing on any of the other drives. So where did this come from? Anything I should do to check in further detail?

EDIT:
FreeNAS-9.3-STABLE-201511040813
Intel E3-1220 v3
Asrock E3C224D2I
16GB ECC
4xSeagate 4TB ST4000DM000 and 2x WD Green 4TB Greens WD40EZRX in a RAIDZ2
 
Last edited:

qwerion

Dabbler
Joined
Jan 30, 2014
Messages
19
8 showed up after running an extended test. So I suppose I should prepare for failure.

I did not change the cable - solarisguy, can you expand upon how you think the cable can impact this?

EDIT: Inserted results after the extended test
Code:
$ sudo smartctl -a /dev/ada0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Desktop HDD.15
Device Model:     ST4000DM000-1F2168
Serial Number:    S301BMGT
LU WWN Device Id: 5 000c50 0807c7b78
Firmware Version: CC54
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Nov 19 23:41:43 2015 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 119) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 525) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       232044416
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       24
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   068   060   030    Pre-fail  Always       -       7079879
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1071
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       24
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   058   045    Old_age   Always       -       37 (Min/Max 19/40)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       63
194 Temperature_Celsius     0x0022   037   042   000    Old_age   Always       -       37 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1071h+01m+17.245s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1856431941
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       69987166964

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       70%      1049         2609375600
# 2  Short offline       Completed without error       00%      1035         -
# 3  Short offline       Completed without error       00%       761         -
# 4  Short offline       Completed without error       00%       688         -
# 5  Short offline       Completed without error       00%       520         -
# 6  Short offline       Completed without error       00%       360         -
# 7  Short offline       Completed without error       00%       135         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited:

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
I suppose I should prepare for failure.
Maybe. At least consider having a burned-in cold spare on hand.

Start with a scrub, then schedule regular extended self-tests, and keep an eye on it. Look at the smartctl output from time to time. Bad sectors in the single-digits isn't a big deal, especially if it's stable, but if the number starts creeping up, I'd replace the drive.
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
If there is an offline uncorrectable sector error, and Offline_Uncorrectable count does not increase, then I would exchange the cable. Just get a new and good SATA cable.

Although some of my drives do not lock, I moved to cables that lock (at least they lock in the SATA ports).
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
If there is an offline uncorrectable sector error, and Offline_Uncorrectable count does not increase, then I would exchange the cable.
Can't agree with this--offline uncorrectable sectors simply have nothing to do with the cable, that is all internal to the drive. Sure, you want to make sure you have good cables, and locking cables are good, but I just don't see any way that they're relevant to this issue.
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
Can't agree with this--offline uncorrectable sectors simply have nothing to do with the cable, that is all internal to the drive. Sure, you want to make sure you have good cables, and locking cables are good, but I just don't see any way that they're relevant to this issue.
Hardware is not my forte, and I had always assumed that offline uncorrectable sectors that are internal to the drive increase Offline_Uncorrectable count.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
As I see it, you have no actual hard drive recorded failures (disregard ID1, this can be a very high count for certain drives). I do not know why FreeNAS would report the failures for drive ada0 when there are none, unless they were fixed, meaning your drive is marginal and failure is forthcoming.

Actions I'd take and recommend you do as well...
1) List your hardware and which version of FreeNAS you are running (including the date code). This is basic information required for us to help not only you but if someone else has the same or similar issue then we can determine if the developers need to fix the code.
2) Run a SMART long test and then check the results and report them here. I'd actually run a long test every day for the next 5 days on all the drives. It's not abusive on the drive but it may weed out a problem. If you are running a RAIDZ1 setup, backup any important data you got while it's available. (See, I have to guess your configuration, drives me nuts at times but I want to cover all the bases).
3) Check all your drives to see if they have any non-zero values in ID's 1, 5, 197, 198, 200 (multizone) as non-zero values (except in ID1 depending on the drive) are not good.
4) Setup a short test nightly for all drives and long tests weekly for all drive. Remember, a SMART test failure indication gives you at the most up to 24 hours heads up before a catastrophic failure occurs, and that's if you're lucky.
5) If none of this revile a problem, I'd replace the SATA cable for the hell of it. Although I don't believe it would cause this type of failure, if you keep getting nightly emails and the SMART test results indicate nothing, the SATA cable is the next best thing.
 

qwerion

Dabbler
Joined
Jan 30, 2014
Messages
19
As I see it, you have no actual hard drive recorded failures (disregard ID1, this can be a very high count for certain drives). I do not know why FreeNAS would report the failures for drive ada0 when there are none, unless they were fixed, meaning your drive is marginal and failure is forthcoming.

Actions I'd take and recommend you do as well...
1) List your hardware and which version of FreeNAS you are running (including the date code). This is basic information required for us to help not only you but if someone else has the same or similar issue then we can determine if the developers need to fix the code.
2) Run a SMART long test and then check the results and report them here. I'd actually run a long test every day for the next 5 days on all the drives. It's not abusive on the drive but it may weed out a problem. If you are running a RAIDZ1 setup, backup any important data you got while it's available. (See, I have to guess your configuration, drives me nuts at times but I want to cover all the bases).
3) Check all your drives to see if they have any non-zero values in ID's 1, 5, 197, 198, 200 (multizone) as non-zero values (except in ID1 depending on the drive) are not good.
4) Setup a short test nightly for all drives and long tests weekly for all drive. Remember, a SMART test failure indication gives you at the most up to 24 hours heads up before a catastrophic failure occurs, and that's if you're lucky.
5) If none of this revile a problem, I'd replace the SATA cable for the hell of it. Although I don't believe it would cause this type of failure, if you keep getting nightly emails and the SMART test results indicate nothing, the SATA cable is the next best thing.
I updated first post with hardware/build info.

I did run an extended test though I didn't post the raw output (now edited into my second post @ #5). The 8 Offline_Uncorrectable errors did show up after the extended test completed. I'm not sure where smartd was getting its initial error message from when smartctl wasn't reporting anything until after an extended test - do they use different sources?

I'm on a Z2 and backed up elsewhere, so not terribly worried, though I definitely will keep an eye out on this drive.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
So I'd recommend still to conduct a long test on that drive for 4 more days and then weekly. If the error count starts to creep up, replace the drive. If the drive is under warranty then I'd replace it. But the ball is in your court as they say.

The difference between a short and long test is the long test reads all the media which is why it takes 525 minutes before it's complete, the short test only checks a few things to ensure the drive is operational.

Lets talk warranty for a second... This is of course a new drive based on the run hours and even new drives have errors periodically. The real fear is if those errors continue to increase. A count of 8 doesn't concern me however if the count continually increases over time then that would be an issue and I would RMA that thing. If the count goes up to say 100 fairly early in the life of the drive and never increases again, that is okay in my book and I'd keep the drive. So you have a 2 year warranty on this drive, I have no idea it's age (I didn't look up the serial number, you can do that) but if by chance you only have 6 months or so left on it, I'd RMA it right now.

There is a tool you could use called badblocks and it may identify more failures on that drive and if you choose to use it, make sure you understand what you are doing before running it. And again, run the SMART long test on that drive frequently. As I think I said above, I run mine weekly and short tests daily. So far I've been blessed with solid running hard drives and I'm well beyond my 3 year warranty period without a single issue.
 
Status
Not open for further replies.
Top