Disk error false positives?

Status
Not open for further replies.

jshurak

Dabbler
Joined
Mar 11, 2014
Messages
10
Hi all, I'm currently receiving some error messages in the gui for two disks:

CRITICAL: Device: /dev/da0 [SAT], 8 Currently unreadable (pending) sectors
CRITICAL: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
CRITICAL: Device: /dev/da0 [SAT], 64 Currently unreadable (pending) sectors
CRITICAL: Device: /dev/da0 [SAT], 64 Offline uncorrectable sectors
CRITICAL: Device: /dev/da1 [SAT], 8 Currently unreadable (pending) sectors
CRITICAL: Device: /dev/da1 [SAT], 8 Offline uncorrectable sectors

I had a similar issue with one disk a month or so ago. I believe it was da1. After long tests and manual scrubbing came back clean, the error messages remained. I was able to clear the warnings by deleting /tmp/.smartalert file and rebooting. However this popped up again.

I'm currently running long smartctl tests on both disks. Nothing looks particularly out of order. I'm wondering if I'm misinterpreting the output or missing something here. Any advice would be greatly appreciated.

The pool appears healthy:
Code:
 ~# zpool status
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h4m with 0 errors on Tue May 10 03:49:11 2016
config:

        NAME                                          STATE     READ WRITE CKSUM
        freenas-boot                                  ONLINE       0     0     0
          gptid/3a2db1b9-18df-11e5-bd82-ac220b50f777  ONLINE       0     0     0

errors: No known data errors

  pool: tank1
state: ONLINE
  scan: scrub repaired 0 in 12h58m with 0 errors on Sun Jun  5 12:58:09 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank1                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/cad7d80d-1906-11e5-b8e2-ac220b50f777  ONLINE       0     0     0
            gptid/cb95f722-1906-11e5-b8e2-ac220b50f777  ONLINE       0     0     0
            gptid/cc44bf4f-1906-11e5-b8e2-ac220b50f777  ONLINE       0     0     0
            gptid/669fd75c-2c0b-11e5-9209-ac220b50f777  ONLINE       0     0     0
            gptid/cda92b39-1906-11e5-b8e2-ac220b50f777  ONLINE       0     0     0
            gptid/ce717a98-1906-11e5-b8e2-ac220b50f777  ONLINE       0     0     0


smartctl -a for /dev/da0
Code:
~# smartctl -a -q noserial /dev/da0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p23 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1CH166
Firmware Version: CC27
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 14 11:11:41 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 247) Self-test routine in progress...
                                        70% of test remaining.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 320) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       223476760
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       83
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   082   060   030    Pre-fail  Always       -       194540065
  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       19303
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       84
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   056   045    Old_age   Always       -       34 (Min/Max 28/36)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       83
193 Load_Cycle_Count        0x0032   065   065   000    Old_age   Always       -       70883
194 Temperature_Celsius     0x0022   034   044   000    Old_age   Always       -       34 (0 15 0 0 0)
197 Current_Pending_Sector  0x0012   100   099   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0010   100   099   000    Old_age   Offline      -       16
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       18639h+34m+09.984s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       49934685866
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       176567283014

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 70%     19303         -
# 2  Short offline       Completed without error       00%     19301         -
# 3  Short offline       Completed without error       00%     19267         -
# 4  Short offline       Completed without error       00%     19266         -
# 5  Short offline       Completed without error       00%     19265         -
# 6  Short offline       Completed without error       00%     19264         -
# 7  Short offline       Completed without error       00%     19263         -
# 8  Short offline       Completed without error       00%     19262         -
# 9  Short offline       Completed without error       00%     19261         -
#10  Short offline       Completed without error       00%     19260         -
#11  Short offline       Completed without error       00%     19259         -
#12  Short offline       Completed without error       00%     19258         -
#13  Short offline       Completed without error       00%     19257         -
#14  Short offline       Completed without error       00%     19256         -
#15  Short offline       Completed without error       00%     19255         -
#16  Short offline       Completed without error       00%     19254         -
#17  Short offline       Completed without error       00%     19253         -
#18  Short offline       Completed without error       00%     19252         -
#19  Short offline       Completed without error       00%     19251         -
#20  Short offline       Completed without error       00%     19250         -
#21  Short offline       Completed without error       00%     19249         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


smartclt -a for /dev/da1
Code:
~# smartctl -a -q noserial /dev/da1
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p23 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1CH166
Firmware Version: CC27
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 14 11:13:48 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 247) Self-test routine in progress...
                                        70% of test remaining.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 319) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail  Always       -       113504200
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       83
  5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       856
  7 Seek_Error_Rate         0x000f   083   060   030    Pre-fail  Always       -       206286575
  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       19304
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       83
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   096   096   000    Old_age   Always       -       4
190 Airflow_Temperature_Cel 0x0022   059   054   045    Old_age   Always       -       41 (Min/Max 33/44)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       83
193 Load_Cycle_Count        0x0032   067   067   000    Old_age   Always       -       67916
194 Temperature_Celsius     0x0022   041   046   000    Old_age   Always       -       41 (0 15 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       18647h+39m+03.104s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       49921365445
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       177148340102

SMART Error Log Version: 1
ATA Error Count: 2
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 18728 hours (780 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d4 00 81 4f c2 00 00  24d+15:09:18.776  SMART EXECUTE OFF-LINE IMMEDIATE
  61 00 18 ff ff ff 4f 00  24d+15:09:18.776  WRITE FPDMA QUEUED
  b0 d4 00 81 4f c2 00 00  24d+15:09:09.982  SMART EXECUTE OFF-LINE IMMEDIATE
  b0 d0 01 00 4f c2 00 00  24d+15:09:09.889  SMART READ DATA
  ec 00 01 00 00 00 00 00  24d+15:09:09.886  IDENTIFY DEVICE

Error 1 occurred at disk power-on lifetime: 18728 hours (780 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d4 00 81 4f c2 00 00  24d+15:09:09.982  SMART EXECUTE OFF-LINE IMMEDIATE
  b0 d0 01 00 4f c2 00 00  24d+15:09:09.889  SMART READ DATA
  ec 00 01 00 00 00 00 00  24d+15:09:09.886  IDENTIFY DEVICE
  ec 00 01 00 00 00 00 00  24d+15:09:09.886  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00  24d+15:09:09.885  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 70%     19304         -
# 2  Short offline       Completed without error       00%     19302         -
# 3  Short offline       Completed without error       00%     19268         -
# 4  Short offline       Completed without error       00%     19267         -
# 5  Short offline       Completed without error       00%     19266         -
# 6  Short offline       Completed without error       00%     19266         -
# 7  Short offline       Completed without error       00%     19264         -
# 8  Short offline       Completed without error       00%     19263         -
# 9  Short offline       Completed without error       00%     19262         -
#10  Short offline       Completed without error       00%     19261         -
#11  Short offline       Completed without error       00%     19260         -
#12  Short offline       Completed without error       00%     19259         -
#13  Short offline       Completed without error       00%     19258         -
#14  Short offline       Completed without error       00%     19257         -
#15  Short offline       Completed without error       00%     19256         -
#16  Short offline       Completed without error       00%     19255         -
#17  Short offline       Completed without error       00%     19254         -
#18  Short offline       Completed without error       00%     19253         -
#19  Short offline       Completed without error       00%     19252         -
#20  Short offline       Completed without error       00%     19251         -
#21  Short offline       Completed without error       00%     19250         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Well, da0 and da1 show 16 and 8, respectively, offline uncorrectable sectors (attribute IDs 197 and 198). That's what the alert system is reporting. Both these disks have bad blocks, though so far they're dealing with them. If the drives are still under warranty, I'd RMA them. If not, keep a close eye on those numbers. If they start to climb, the drive is failing and should be replaced quickly.

Your SMART test schedules look a bit off, too--it looks like your drive is running short tests every hour. Daily is plenty, and even every few days is fine. A long test should also be run periodically--I can't tell if you're doing that, since the SMART self-test log only stores 21 results.
 

jshurak

Dabbler
Joined
Mar 11, 2014
Messages
10
Thanks, when does the WHEN_FAILED column become populated? I think that's confusing me. When the raw_value breaks 100? I didn't realize I had hourly smart tests. They've been updated to run once every few days. New drives enroute.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
An attribute is considered failed when the VALUE becomes less than or equal to the THRESHold. The lowest a VALUE can be no matter how high it's RAW is 1.

This means that for Attribute 197 and 198 that have a THRESH of 0, they will NEVER be considered failed.

If a drive gets bad enough it actually trips one of the tripable attributes, it's usually SO bad that it's unusable.

Especially concerning is da1, which lists:

Code:
5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       856


I'd run a destructive badblocks test on each drive, and assuming it passes, another long smart test. It's also had two hard errors passes to the OS. That's the drive replying to the OS saying "I know you asked for this sector, but I can't read it, sorry".

The other one without reallocated sectors, but WITH the offline uncorrectable and pending sectors, I'd still run badblocks on it to exercise the surface and tease out any spots that might be about to go bad. I find badblocks good for forcing a (questionable) drive into (worse) failure before RMA.

You've got a z2 pool, so you should be safe enough to do this one drive at a time. You do however have two drives that have potential problems. Whether they cause real problems is another story.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 856
I suggest you replace da1 ASAP. Triple digit reallocated sectors is outside my comfort zone.

Keep a close eye on da0. It's probably headed in the same direction as da1, because ST3000DM001 is a model notorious for early, sudden failure. If you have more of that model in your pool, now would be a good time to test your backups.
New drives enroute.
:cool:
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Keep a close eye on da0. It's probably headed in the same direction as da1, because ST3000DM001 is a model notorious for early, sudden failure. If you have more of that model in your pool, now would be a good time to test your backups.

:cool:

I've been fine so far:

Code:
root@nas ~ # camcontrol devlist | grep ST3000
<ATA ST3000DM001-1ER1 CC25>  at scbus0 target 2 lun 0 (pass0,da0)
<ATA ST3000DM001-1CH1 CC49>  at scbus0 target 13 lun 0 (pass1,da1)
<ATA ST3000DM001-1CH1 CC27>  at scbus0 target 15 lun 0 (pass3,da3)
<ATA ST3000DM001-1CH1 CC47>  at scbus0 target 19 lun 0 (pass5,da5)
<ATA ST3000DM001-1ER1 CC25>  at scbus0 target 20 lun 0 (pass6,da6)
<ATA ST3000DM001-1ER1 CC46>  at scbus0 target 23 lun 0 (pass7,da7)
<ATA ST3000DM001-1CH1 CC27>  at scbus1 target 1 lun 0 (pass9,da9)
<ATA ST3000DM001-1ER1 CC25>  at scbus1 target 2 lun 0 (pass10,da10)
<ATA ST3000DM001-1CH1 CC26>  at scbus1 target 3 lun 0 (pass11,da11)
<ATA ST3000DM001-1CH1 CC29>  at scbus1 target 5 lun 0 (pass13,da13)
<ATA ST3000DM001-1CH1 CC27>  at scbus1 target 6 lun 0 (pass14,da14)
<ATA ST3000DM001-1ER1 CC25>  at scbus1 target 7 lun 0 (pass15,da15)
<ATA ST3000DM001-1ER1 CC43>  at scbus2 target 4 lun 0 (pass16,da16)
<ATA ST3000DM001-9YN1 CC4H>  at scbus2 target 5 lun 0 (pass17,da17)
<ATA ST3000DM001-1ER1 CC25>  at scbus2 target 8 lun 0 (pass18,da18)
<ATA ST3000DM001-1ER1 CC25>  at scbus2 target 13 lun 0 (pass19,da19)
<ATA ST3000DM001-1CH1 CC27>  at scbus2 target 14 lun 0 (pass20,da20)
<ATA ST3000DM001-1ER1 CC43>  at scbus2 target 17 lun 0 (pass22,da22)


This is 18 drives in 3 groups of 6 using z2. But I do agree, they seem to have higher failure rates than other models.
 
Status
Not open for further replies.
Top