Hard drive failing or backplane issue?

Eria211 · Jun 9, 2020

I have just had a very strange error, a disk just randomly disconnected from the pool - I assumed it had failed and resigned myself to getting it RMA'd and then replaced

I offlined the disk in the GUI and then ran a query on the SMART status of the disk that had dropped out:

Code:

# smartctl -a /dev/da29
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST10000NM0096
Revision:             E005
Compliance:           SPC-4
User Capacity:        10,000,831,348,736 bytes [10.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500ae10409b
Serial number:        <serial number>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Jun  9 12:38:44 2020 BST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     34 C
Drive Trip Temperature:        60 C

Manufactured in week 24 of year 2019
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  140
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  306
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 50826152
  Blocks received from initiator = 1326407200
  Blocks read from cache and sent to initiator = 2455860
  Number of read and write commands whose size <= segment size = 15160418
  Number of read and write commands whose size > segment size = 147550

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 2592.82
  number of minutes until next internal SMART test = 38

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    6382600        0         0   6382600          0         26.023           0
write:         0        0         0         0          0       2881.274           0

Non-medium error count:        6


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -    2576                 - [-   -    -]
# 2  Background short  Completed                   -    2552                 - [-   -    -]
# 3  Background short  Completed                   -    2528                 - [-   -    -]
# 4  Background short  Completed                   -    2504                 - [-   -    -]
# 5  Background short  Completed                   -    2480                 - [-   -    -]
# 6  Background short  Completed                   -    2456                 - [-   -    -]
# 7  Background short  Completed                   -    2432                 - [-   -    -]
# 8  Background long   Completed                   -    2424                 - [-   -    -]
# 9  Background short  Completed                   -    2408                 - [-   -    -]
#10  Background short  Completed                   -    2384                 - [-   -    -]
#11  Background short  Completed                   -    2360                 - [-   -    -]
#12  Background short  Completed                   -    2336                 - [-   -    -]
#13  Background short  Completed                   -    2312                 - [-   -    -]
#14  Background short  Completed                   -    2288                 - [-   -    -]
#15  Background short  Completed                   -    2264                 - [-   -    -]
#16  Background short  Completed                   -    2240                 - [-   -    -]
#17  Background short  Completed                   -    2216                 - [-   -    -]
#18  Background short  Completed                   -    2192                 - [-   -    -]
#19  Background short  Completed                   -    2168                 - [-   -    -]
#20  Background short  Completed                   -    2144                 - [-   -    -]

Long (extended) Self-test duration: 55333 seconds [922.2 minutes]

This seems to indicate to me there's nothing wrong with my disk

After offlining the disk I was able to wipe the disk through the GUI and then replace the disk with itself and the array is currently resilvering with the disk that dropped out!

It is currently at 57% without any issues

This is the output from the console:

Code:

Jun  9 10:58:24 backup ZFS: vdev state changed, pool_guid=3015116645274777860 vdev_guid=1888851438361380937
Jun  9 10:58:24 backup ZFS: vdev is removed, pool_guid=3015116645274777860 vdev_guid=1888851438361380937
Jun  9 10:58:24 backup mpr0: mprsas_prepare_remove: Sending reset for target ID 19
Jun  9 10:58:24 backup da11 at mpr0 bus 0 scbus12 target 19 lun 0
Jun  9 10:58:24 backup da11: <SEAGATE ST10000NM0096 E005> s/n <serial number> detached
Jun  9 10:58:24 backup GEOM_MULTIPATH: da11 in disk12 was disconnected
Jun  9 10:58:24 backup GEOM_MULTIPATH: da11 removed from disk12
Jun  9 10:58:24 backup (da11:mpr0:0:19:0): Periph destroyed
Jun  9 10:58:24 backup mpr0: clearing target 19 handle 0x0016
Jun  9 10:58:24 backup mpr0: At enclosure level 0, slot 11, connector name (    )
Jun  9 10:58:24 backup mpr0: Unfreezing devq for target ID 19
Jun  9 10:58:24 backup mpr0: mprsas_prepare_remove: Sending reset for target ID 124
Jun  9 10:58:24 backup da29 at mpr0 bus 0 scbus12 target 124 lun 0
Jun  9 10:58:24 backup da29: <SEAGATE ST10000NM0096 E005> s/n <serial number> detached
Jun  9 10:58:24 backup mpr0: clearing target 124 handle 0x002c
Jun  9 10:58:24 backup mpr0: At enclosure level 1, slot 11, connector name (    )
Jun  9 10:58:24 backup mpr0: Unfreezing devq for target ID 124
Jun  9 10:58:24 backup GEOM_MULTIPATH: da29 in disk12 was disconnected
Jun  9 10:58:24 backup GEOM_MULTIPATH: out of providers for disk12
Jun  9 10:58:24 backup GEOM_MULTIPATH: da29 removed from disk12
Jun  9 10:58:24 backup GEOM_MULTIPATH: destroying disk12
Jun  9 10:58:24 backup (da29:mpr0:0:124:0): Periph destroyed
Jun  9 10:58:24 backup GEOM_ELI: Device gptid/8695ac43-9388-11ea-b61d-ac1f6bbc06e4.eli destroyed.
Jun  9 10:58:24 backup GEOM_MULTIPATH: disk12 destroyed
Jun  9 10:58:24 backup GEOM_ELI: Detached gptid/8695ac43-9388-11ea-b61d-ac1f6bbc06e4.eli on last close.
Jun  9 11:14:41 backup mpr0: SAS Address from SAS device page0 = 5000c500ae104099
Jun  9 11:14:41 backup mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x0016> enclosureHandle<0x0002> slot 11
Jun  9 11:14:41 backup mpr0: At enclosure level 0 and connector name (    )
Jun  9 11:14:41 backup mpr0: SAS Address from SAS device page0 = 5000c500ae104099
Jun  9 11:14:41 backup mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x002c> enclosureHandle<0x0004> slot 11
Jun  9 11:14:41 backup mpr0: At enclosure level 1 and connector name (    )
Jun  9 11:14:42 backup     (probe0:mpr0:0:19:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 741 terminated ioc 804b loginfo 31110e05 scsi 0 state c xfer 0
Jun  9 11:14:42 backup (probe0:mpr0:0:19:0): INQUIRY. CDB: 12 00 00 00 24 00
Jun  9 11:14:42 backup (probe0:mpr0:0:19:0): CAM status: CCB request completed with an error
Jun  9 11:14:42 backup (probe0:mpr0:0:19:0): Retrying command
Jun  9 11:14:42 backup     (probe1:mpr0:0:124:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 584 terminated ioc 804b loginfo 31110e05 scsi 0 state c xfer 0
Jun  9 11:14:42 backup (probe1:mpr0:0:124:0): INQUIRY. CDB: 12 00 00 00 24 00
Jun  9 11:14:42 backup (probe1:mpr0:0:124:0): CAM status: CCB request completed with an error
Jun  9 11:14:42 backup (probe1:mpr0:0:124:0): Retrying command
Jun  9 11:15:02 backup ses5: da11,pass15,da29,pass36 in 'Slot11', SAS Slot: 1 phys at slot 11
Jun  9 11:15:02 backup ses5:  phy 0: SAS device type 1 phy 0 Target ( SSP )
Jun  9 11:15:02 backup ses5:  phy 0: parent 5003048020986aff addr 5000c500ae104099
Jun  9 11:15:02 backup da11 at mpr0 bus 0 scbus12 target 19 lun 0
Jun  9 11:15:02 backup da11: <SEAGATE ST10000NM0096 E005> Fixed Direct Access SPC-4 SCSI device
Jun  9 11:15:02 backup da11: Serial Number <serial number>
Jun  9 11:15:02 backup da11: 1200.000MB/s transfers
Jun  9 11:15:02 backup da11: Command Queueing enabled
Jun  9 11:15:02 backup da11: 9537536MB (19532873728 512 byte sectors)
Jun  9 11:15:02 backup da29 at mpr0 bus 0 scbus12 target 124 lun 0
Jun  9 11:15:02 backup da29: <SEAGATE ST10000NM0096 E005> Fixed Direct Access SPC-4 SCSI device
Jun  9 11:15:02 backup da29: Serial Number <serial number>
Jun  9 11:15:02 backup da29: 1200.000MB/s transfers
Jun  9 11:15:02 backup da29: Command Queueing enabled
Jun  9 11:15:02 backup da29: 9537536MB (19532873728 512 byte sectors)
Jun  9 11:15:02 backup GEOM_MULTIPATH: disk12 created
Jun  9 11:15:02 backup GEOM_MULTIPATH: da11 added to disk12
Jun  9 11:15:02 backup GEOM_MULTIPATH: da11 is now active path in disk12
Jun  9 11:15:03 backup GEOM_MULTIPATH: da29 added to disk12

What has gone wrong here? As so far it seems nothing is wrong and the unit is operating

Eria211 · Jun 9, 2020

My disks are all on multipaths, though I only use one SAS cable currently rather than 2 which is available to me on the backplane

Eria211 · Jun 9, 2020

My multipaths

Code:

Name     Status     LUN ID     
 multipath/disk12    OPTIMAL   
 da29    PASSIVE    5000c500ae10409b
 da11    ACTIVE    5000c500ae10409b
 multipath/disk1    OPTIMAL   
 da0    PASSIVE    5000c500ae0fedcf
 da18    ACTIVE    5000c500ae0fedcf
 multipath/disk2    OPTIMAL   
 da1    PASSIVE    5000c500ae0f5447
 da19    ACTIVE    5000c500ae0f5447
 multipath/disk3    OPTIMAL   
 da2    PASSIVE    5000c500ae102547
 da20    ACTIVE    5000c500ae102547
 multipath/disk4    OPTIMAL   
 da3    PASSIVE    5000c500ae102a83
 da21    ACTIVE    5000c500ae102a83
 multipath/disk5    OPTIMAL   
 da4    PASSIVE    5000c500ae0fcc6b
 da22    ACTIVE    5000c500ae0fcc6b
 multipath/disk6    OPTIMAL   
 da5    PASSIVE    5000c500ae0fb68f
 da23    ACTIVE    5000c500ae0fb68f
 multipath/disk7    OPTIMAL   
 da6    PASSIVE    5000c500ae1029f3
 da24    ACTIVE    5000c500ae1029f3
 multipath/disk8    OPTIMAL   
 da7    PASSIVE    5000c500ae0f52bb
 da25    ACTIVE    5000c500ae0f52bb
 multipath/disk9    OPTIMAL   
 da8    PASSIVE    5000c500ae105ebf
 da26    ACTIVE    5000c500ae105ebf
 multipath/disk10    OPTIMAL   
 da9    PASSIVE    5000c500ae0fdd83
 da27    ACTIVE    5000c500ae0fdd83
 multipath/disk11    OPTIMAL   
 da10    PASSIVE    5000c500ae0fd78f
 da28    ACTIVE    5000c500ae0fd78f
 multipath/disk13    OPTIMAL   
 da12    PASSIVE    5000c500ae0fee33
 da30    ACTIVE    5000c500ae0fee33
 multipath/disk14    OPTIMAL   
 da13    PASSIVE    5000c500ae0fc4cb
 da31    ACTIVE    5000c500ae0fc4cb
 multipath/disk15    OPTIMAL   
 da14    PASSIVE    5000c500ae0fe07b
 da32    ACTIVE    5000c500ae0fe07b
 multipath/disk16    OPTIMAL   
 da15    PASSIVE    5000c500ae0fdda3
 da33    ACTIVE    5000c500ae0fdda3
 multipath/disk17    OPTIMAL   
 da16    PASSIVE    5000c500ae0f536b
 da34    ACTIVE    5000c500ae0f536b
 multipath/disk18    OPTIMAL   
 da17    PASSIVE    5000c500ae0fda37
 da35    ACTIVE    5000c500ae0fda37

Eria211 · Jun 9, 2020

the resilvering finished and the disk is operating as normal so far, very odd, is it likely that this was a one off or is it indicative of hard drive / backplane issues?

Heracles · Jun 9, 2020

Eria211 said:
the resilvering finished and the disk is operating as normal so far, very odd, is it likely that this was a one off or is it indicative of hard drive / backplane issues?

Hi,

From what you described, bad cable, poor connection, backplane issue, all of them look more realistic than hard drive failure.

sretalla · Jun 9, 2020

It looks to implicate the backplane as far as I can see, so it's worth keeping a close eye on it.

Eria211 · Jun 9, 2020

Thanks for your replies

Is it likely to help having 2 cables going to my HBA instead of one?

Samuel Tai · Jun 9, 2020

Eria211 said:
Thanks for your replies

Is it likely to help having 2 cables going to my HBA instead of one?

Multipath is its own black art.

Important Announcement for the TrueNAS Community.

Hard drive failing or backplane issue?

Eria211

Cadet

Eria211

Cadet

Eria211

Cadet

Eria211

Cadet

Heracles

Wizard

sretalla

Powered by Neutrality

Eria211

Cadet

Samuel Tai

Never underestimate your own stupidity

Similar threads