Hard drive failed s.m.a.r.t. test

Status
Not open for further replies.

Raul Prado

Cadet
Joined
Jul 24, 2015
Messages
4
Hello all!
This week I received the report email saying a drive failed a s.m.a.r.t. test.
I am wondering if there is a way to go deeper and found whats happening.

The volume status shows all the disks without any failure. The alert si on green. I am ready to replace the disk, but I wanna know what happened.

Thanks for your advice!

My sistem info:

2IssZDo.png


My Volume status:

X4KddVt.png


And the smartctl output:

Code:
[root@subternia] ~# smartctl -a /dev/da6

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD2001FYYG-01SL3
Revision:             VR07
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01d31efc
Serial number:        WMC160166386
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Oct  1 18:18:41 2015 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     33 C
Drive Trip Temperature:        69 C

Manufactured in week 46 of year 2013
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  159
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  5
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      15517       11       846     15528         14      81857.645           3
write:    113907      536       540    114443        536      22858.927           0

Non-medium error count:      382

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   11311                 - [-   -    -]
# 2  Background short  Completed                   -   11264                 - [-   -    -]
# 3  Background long   Failed in segment -->       6   11232         111755709 [0x3 0x11 0x0]
# 4  Background short  Completed                   -   11231                 - [-   -    -]
# 5  Background short  Completed                   -    9682                 - [-   -    -]
# 6  Background short  Completed                   -    9658                 - [-   -    -]
# 7  Background short  Completed                   -    9634                 - [-   -    -]
# 8  Background short  Completed                   -    9610                 - [-   -    -]
# 9  Background short  Completed                   -    9586                 - [-   -    -]
#10  Background short  Completed                   -    9562                 - [-   -    -]
#11  Background short  Completed                   -    9538                 - [-   -    -]
#12  Background short  Completed                   -    9514                 - [-   -    -]
#13  Background short  Completed                   -    9490                 - [-   -    -]
#14  Background short  Completed                   -    9466                 - [-   -    -]
#15  Background short  Completed                   -    9442                 - [-   -    -]
#16  Background short  Completed                   -    9418                 - [-   -    -]
#17  Background short  Completed                   -    9394                 - [-   -    -]
#18  Background short  Completed                   -    9370                 - [-   -    -]
#19  Background short  Completed                   -    9346                 - [-   -    -]
#20  Background short  Completed                   -    9322                 - [-   -    -]

Long (extended) Self Test duration: 15620 seconds [260.3 minutes]
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It means that the drive developed a bad sector that was detected by a SMART long test, which typically tests the entire set of sectors on the disk. Short tests usually don't pick up single sector or short-run-of-sector failures. It is also telling you the address of the first failed LBA in case you want to go poking at it, which I kinda like to do.

ZFS won't show it as an error unless it tries and fails to access the damaged portions of the disk. That's why we like to also have SMART tests running.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Manually run another long test. Also, what exactly did the email say (cut and paste it). And to be honest, I've never seen a SMART report like that, I'm not sure of anything other than it states there were 3 Uncorrectable Read Errors and the long test failed. What is the output for "smartctl -x /dev/da6"?

EDIT: You are likely looking at a drive replacement.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Manually run another long test. Also, what exactly did the email say (cut and paste it). And to be honest, I've never seen a SMART report like that, I'm not sure of anything other than it states there were 3 Uncorrectable Read Errors and the long test failed. What is the output for "smartctl -x /dev/da6"?

SAS drives are a bit different...
 

Raul Prado

Cadet
Joined
Jul 24, 2015
Messages
4
It means that the drive developed a bad sector that was detected by a SMART long test, which typically tests the entire set of sectors on the disk. Short tests usually don't pick up single sector or short-run-of-sector failures. It is also telling you the address of the first failed LBA in case you want to go poking at it, which I kinda like to do.

ZFS won't show it as an error unless it tries and fails to access the damaged portions of the disk. That's why we like to also have SMART tests running.

Thank you for the info!
Is there any way to realize how the drive developed a bad sector? Dust? Temperature? or is it a common problem in drives?
 

Raul Prado

Cadet
Joined
Jul 24, 2015
Messages
4
Manually run another long test. Also, what exactly did the email say (cut and paste it). And to be honest, I've never seen a SMART report like that, I'm not sure of anything other than it states there were 3 Uncorrectable Read Errors and the long test failed. What is the output for "smartctl -x /dev/da6"?

EDIT: You are likely looking at a drive replacement.

Thank you for your response!
Yes I am ready to replace the bad drive. I have 4 spares for this kind of issues.

The email:
subternia.local kernel log messages:
(da6:mps0:0:7:0): READ(10). CDB: 28 00 06 a9 4b b0 00 00 38 00
(da6:mps0:0:7:0): CAM status: SCSI Status Error
(da6:mps0:0:7:0): SCSI status: Check Condition
(da6:mps0:0:7:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:7:0): Info: 0x6a94bb1
(da6:mps0:0:7:0): Actual Retry Count: 277
(da6:mps0:0:7:0): Error 5, Unretryable error
(da6:mps0:0:7:0): READ(10). CDB: 28 00 06 a9 55 90 00 00 20 00
(da6:mps0:0:7:0): CAM status: SCSI Status Error
(da6:mps0:0:7:0): SCSI status: Check Condition
(da6:mps0:0:7:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:7:0): Info: 0x6a955a6
(da6:mps0:0:7:0): Actual Retry Count: 277
(da6:mps0:0:7:0): Error 5, Unretryable error
(da6:mps0:0:7:0): READ(10). CDB: 28 00 06 a9 5f 80 00 00 30 00
(da6:mps0:0:7:0): CAM status: SCSI Status Error
(da6:mps0:0:7:0): SCSI status: Check Condition
(da6:mps0:0:7:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:7:0): Info: 0x6a95f9a
(da6:mps0:0:7:0): Actual Retry Count: 277
(da6:mps0:0:7:0): Error 5, Unretryable error

-- End of security output --




And the output:

Code:
[root@subternia] ~# smartctl -x /dev/da6
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD2001FYYG-01SL3
Revision:             VR07
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01d31efc
Serial number:        WMC160166386
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Oct  1 18:57:11 2015 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     33 C
Drive Trip Temperature:        69 C

Manufactured in week 46 of year 2013
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  159
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  5
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      15517       11       846     15528         14      81858.991           3
write:    113907      536       540    114443        536      22859.129           0

Non-medium error count:      382

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   11311                 - [-   -    -]
# 2  Background short  Completed                   -   11264                 - [-   -    -]
# 3  Background long   Failed in segment -->       6   11232         111755709 [0x3 0x11 0x0]
# 4  Background short  Completed                   -   11231                 - [-   -    -]
# 5  Background short  Completed                   -    9682                 - [-   -    -]
# 6  Background short  Completed                   -    9658                 - [-   -    -]
# 7  Background short  Completed                   -    9634                 - [-   -    -]
# 8  Background short  Completed                   -    9610                 - [-   -    -]
# 9  Background short  Completed                   -    9586                 - [-   -    -]
#10  Background short  Completed                   -    9562                 - [-   -    -]
#11  Background short  Completed                   -    9538                 - [-   -    -]
#12  Background short  Completed                   -    9514                 - [-   -    -]
#13  Background short  Completed                   -    9490                 - [-   -    -]
#14  Background short  Completed                   -    9466                 - [-   -    -]
#15  Background short  Completed                   -    9442                 - [-   -    -]
#16  Background short  Completed                   -    9418                 - [-   -    -]
#17  Background short  Completed                   -    9394                 - [-   -    -]
#18  Background short  Completed                   -    9370                 - [-   -    -]
#19  Background short  Completed                   -    9346                 - [-   -    -]
#20  Background short  Completed                   -    9322                 - [-   -    -]

Long (extended) Self Test duration: 15620 seconds [260.3 minutes]

Background scan results log
  Status: no scans active
    Accumulated power on time, hours:minutes 11324:27 [679467 minutes]
    Number of background scans performed: 0,  scan progress: 0.00%
    Number of background medium scans performed: 0

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x50000c0f01d31efe
    attached SAS address = 0x500605b0064aa0c5
    attached phy identifier = 7
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 1
    Phy reset problem = 0
    Phy event descriptors:
     Transmitted SSP frame error count: 0
     Received SSP frame error count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x50000c0f01d31eff
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Transmitted SSP frame error count: 0
     Received SSP frame error count: 0



I will run tonight a new long test and get back to you with the new output.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Single bad sectors are rarely an immediate problem, but they do tend to be harbingers for bigger issues (like greater numbers of bad sectors).
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Sorta, yes, but on the other hand, it's astonishing how reliable hard drives are. Four or five double-sided platters holding gigabits per square centimeter, along with all those heads.. You've got a billion sectors of data out there and they're all perfectly readable.
 
Status
Not open for further replies.
Top