Spare resilvering, then out again - but all "online"

JaimieV · Mar 27, 2019

What happened here? Transient issue with one disk, that self-healed after the spare had been switched in?

I got an alert email,

Code:

New alerts:
* The volume DataPool state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

zpool status tells me:

Code:

  pool: DataPool
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Mar 27 12:39:52 2019
    16.3T scanned at 992M/s, 16.0T issued at 973M/s, 16.3T total
    1.49T resilvered, 98.04% done, 0 days 00:05:44 to go
config:
    NAME                                              STATE     READ WRITE CKSUM
    DataPool                                          ONLINE       0     0     0
      raidz1-0                                        ONLINE       0     0     0
        gptid/95bc5b38-45d4-11e9-b26b-a0369f4e18bc    ONLINE       0     0     0
        gptid/0f0e1314-3784-11e9-a44b-1866da830726    ONLINE       0     0     0
        gptid/18382224-3784-11e9-a44b-1866da830726    ONLINE       0     0     0
      raidz1-1                                        ONLINE       0     0     0
        gptid/243629bc-3784-11e9-a44b-1866da830726    ONLINE       0     0     0
        gptid/bb2a79b2-43f7-11e9-a821-a0369f4e18bc    ONLINE       0     0     0
        gptid/3a1b891f-3784-11e9-a44b-1866da830726    ONLINE       0     0     0
      raidz1-2                                        ONLINE       0     0     0
        gptid/6d56f648-44b4-11e9-a821-a0369f4e18bc    ONLINE       0     0     0
        spare-1                                       ONLINE       0     0     0
          gptid/89c921a9-45a7-11e9-83e8-a0369f4e18bc  ONLINE       0     0     0
          gptid/d5888e7b-45d4-11e9-b26b-a0369f4e18bc  ONLINE       0     0     0
        gptid/8365c061-44b4-11e9-a821-a0369f4e18bc    ONLINE       0     0     0
    spares
      3049612206338135796                             INUSE     was /dev/gptid/d5888e7b-45d4-11e9-b26b-a0369f4e18bc
      gptid/d6aa21af-45d4-11e9-b26b-a0369f4e18bc      AVAIL 
errors: No known data errors

And ten minutes later when it completed, all was back to normal:

Code:

  pool: DataPool
state: ONLINE
  scan: resilvered 1.59T in 0 days 05:18:42 with 0 errors on Wed Mar 27 17:58:34 2019
config:
    NAME                                            STATE     READ WRITE CKSUM
    DataPool                                        ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/95bc5b38-45d4-11e9-b26b-a0369f4e18bc  ONLINE       0     0     0
        gptid/0f0e1314-3784-11e9-a44b-1866da830726  ONLINE       0     0     0
        gptid/18382224-3784-11e9-a44b-1866da830726  ONLINE       0     0     0
      raidz1-1                                      ONLINE       0     0     0
        gptid/243629bc-3784-11e9-a44b-1866da830726  ONLINE       0     0     0
        gptid/bb2a79b2-43f7-11e9-a821-a0369f4e18bc  ONLINE       0     0     0
        gptid/3a1b891f-3784-11e9-a44b-1866da830726  ONLINE       0     0     0
      raidz1-2                                      ONLINE       0     0     0
        gptid/6d56f648-44b4-11e9-a821-a0369f4e18bc  ONLINE       0     0     0
        gptid/89c921a9-45a7-11e9-83e8-a0369f4e18bc  ONLINE       0     0     0
        gptid/8365c061-44b4-11e9-a821-a0369f4e18bc  ONLINE       0     0     0
    spares
      gptid/d5888e7b-45d4-11e9-b26b-a0369f4e18bc    AVAIL 
      gptid/d6aa21af-45d4-11e9-b26b-a0369f4e18bc    AVAIL 
errors: No known data errors

Er. What? Do I need to worry about the da7 aka gptid/89c... disk? Twiddle any timeouts? Five hours of resilvering after what looks like a seven second outage seems a bit strong.

/var/log/messages has some info. I can see the obvious dropout and retry but I don't know any deeper interpretation:

Code:

Mar 27 12:22:31 Sisyphus kernel: ix1: link state changed to UP
Mar 27 12:38:25 Sisyphus     (da7:mps0:0:27:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 602 Aborting command 0xfffffe00015f2620
Mar 27 12:38:25 Sisyphus mps0: Sending reset from mpssas_send_abort for target ID 27
Mar 27 12:38:27 Sisyphus mps0: mpssas_prepare_remove: Sending reset for target ID 27
Mar 27 12:38:27 Sisyphus da7 at mps0 bus 0 scbus0 target 27 lun 0
Mar 27 12:38:27 Sisyphus da7: <ATA ST4000NM0033-9ZM GA6C> s/n Z1Z90VPC detached
Mar 27 12:38:28 Sisyphus     (da7:mps0:0:27:0): READ(16). CDB: 88 00 00 00 00 01 17 83 18 c8 00 00 00 80 00 00 length 65536 SMID 623 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Mar 27 12:38:28 Sisyphus     (da7:mps0:0:27:0): READ(16). CDB: 88 00 00 00 00 01 17 83 73 48 00 00 00 80 00 00 length 65536 SMID 171 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): READ(16). CDB: 88 00 00 00 00 01 17 83 18 c8 00 00 00 80 00 00
Mar 27 12:38:28 Sisyphus     (da7:mps0:0:27:0): READ(16). CDB: 88 00 00 00 00 01 85 f8 40 80 00 00 00 80 00 00 length 65536 SMID 895 terminated ioc 804b l(da7:mps0:0:27:0): CAM status: CCB request completed with an error
Mar 27 12:38:28 Sisyphus oginfo 31130000 scsi 0 state c xfer 0
Mar 27 12:38:28 Sisyphus mps0: (da7:mps0:0:27:0): Error 5, Periph was invalidated
Mar 27 12:38:28 Sisyphus Unfreezing devq for target ID 27
Mar 27 12:38:28 Sisyphus mps0: Unfreezing devq for target ID 27
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): READ(16). CDB: 88 00 00 00 00 01 17 83 73 48 00 00 00 80 00 00
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): CAM status: CCB request completed with an error
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): Error 5, Periph was invalidated
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): READ(16). CDB: 88 00 00 00 00 01 85 f8 40 80 00 00 00 80 00 00
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): CAM status: CCB request completed with an error
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): Error 5, Periph was invalidated
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): CAM status: Command timeout
Mar 27 12:38:28 Sisyphus (da7:mps0:0:27:0): Error 5, Periph was invalidated
Mar 27 12:38:28 Sisyphus GEOM_MIRROR: Device swap0: provider da7p1 disconnected.
Mar 27 12:38:28 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=15465247203013771127
Mar 27 12:38:29 Sisyphus (da7:mps0:0:27:0): Periph destroyed
Mar 27 12:38:29 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=15465247203013771127
Mar 27 12:38:29 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=6396531611511845079
Mar 27 12:38:33 Sisyphus mps0: SAS Address for SATA device = 371b835464926847
Mar 27 12:38:33 Sisyphus mps0: SAS Address from SATA device = 371b835464926847
Mar 27 12:38:33 Sisyphus     (probe0:mps0:0:27:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 398 terminated ioc 804b loginfo 31170000 scsi 0 state c xfer 0
Mar 27 12:38:33 Sisyphus (probe0:mps0:0:27:0): INQUIRY. CDB: 12 00 00 00 24 00
Mar 27 12:38:33 Sisyphus (probe0:mps0:0:27:0): CAM status: CCB request completed with an error
Mar 27 12:38:33 Sisyphus (probe0:mps0:0:27:0): Retrying command
Mar 27 12:38:35 Sisyphus ses0: da7,pass8: SAS Device Slot Element: 1 Phys at Slot 2, Not All Phys
Mar 27 12:38:35 Sisyphus ses0:  phy 0: SATA device
Mar 27 12:38:35 Sisyphus ses0:  phy 0: parent 50050cc10b3710bf addr 50050cc10b371099
Mar 27 12:38:35 Sisyphus da7 at mps0 bus 0 scbus0 target 27 lun 0
Mar 27 12:38:35 Sisyphus da7: <ATA ST4000NM0033-9ZM GA6C> Fixed Direct Access SPC-4 SCSI device
Mar 27 12:38:35 Sisyphus da7: Serial Number Z1Z90VPC
Mar 27 12:38:35 Sisyphus da7: 300.000MB/s transfers
Mar 27 12:38:35 Sisyphus da7: Command Queueing enabled
Mar 27 12:38:35 Sisyphus da7: 3815447MB (7814037168 512 byte sectors)
Mar 27 12:38:45 Sisyphus GEOM_ELI: Device mirror/swap0.eli destroyed.
Mar 27 12:38:45 Sisyphus GEOM_MIRROR: Device swap0: provider destroyed.
Mar 27 12:38:45 Sisyphus GEOM_MIRROR: Device swap0 destroyed.
Mar 27 12:38:47 Sisyphus GEOM_MIRROR: Device mirror/swap0 launched (2/2).
Mar 27 12:38:48 Sisyphus GEOM_ELI: Device mirror/swap0.eli created.
Mar 27 12:38:48 Sisyphus GEOM_ELI: Encryption: AES-XTS 128
Mar 27 12:38:48 Sisyphus GEOM_ELI:     Crypto: hardware
Mar 27 12:38:49 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=8411539843296667809
Mar 27 12:38:49 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=15465247203013771127
Mar 27 12:38:49 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=3049612206338135796
Mar 27 12:38:49 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=8887614878914963373
...
Mar 27 17:58:35 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=8411539843296667809
Mar 27 17:58:35 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=15465247203013771127
Mar 27 17:58:35 Sisyphus ZFS: vdev state changed, pool_guid=14803431099891158073 vdev_guid=8887614878914963373

JaimieV · Mar 27, 2019

Actually, looking at smartctl:

Code:

root@Sisyphus:~ # smartctl -a /dev/da7
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES.3
Device Model:     ST4000NM0033-9ZM170
Serial Number:    Z1Z90VPC
LU WWN Device Id: 5 000c50 07bb0870b
Add. Product Id:  DELL(tm)
Firmware Version: GA6C
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Mar 27 20:30:55 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   90) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 482) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x010f   078   063   ---    Pre-fail  Always       -       81907675
  3 Spin_Up_Time            0x0103   092   092   ---    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   095   095   ---    Old_age   Always       -       5520
  5 Reallocated_Sector_Ct   0x0133   099   099   ---    Pre-fail  Always       -       193
  7 Seek_Error_Rate         0x000f   089   060   ---    Pre-fail  Always       -       834572790
  9 Power_On_Hours          0x0032   071   071   ---    Old_age   Always       -       25846
10 Spin_Retry_Count        0x0013   100   100   ---    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   095   095   ---    Old_age   Always       -       5518
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   ---    Old_age   Always       -       4295032833
189 High_Fly_Writes         0x003a   095   095   ---    Old_age   Always       -       5
190 Airflow_Temperature_Cel 0x0022   074   047   ---    Old_age   Always       -       26 (Min/Max 15/27)
191 G-Sense_Error_Rate      0x0032   100   100   ---    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   098   098   ---    Old_age   Always       -       5515
193 Load_Cycle_Count        0x0032   097   097   ---    Old_age   Always       -       6590
194 Temperature_Celsius     0x0022   026   053   ---    Old_age   Always       -       26 (0 15 0 0 0)
195 Hardware_ECC_Recovered  0x001a   054   013   ---    Old_age   Always       -       81907675
196 Reallocated_Event_Count 0x0032   000   000   ---    Old_age   Always       -       22038
197 Current_Pending_Sector  0x0012   100   100   ---    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   ---    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   ---    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   ---    Old_age   Offline      -       26228 (13 22 0)
241 Total_LBAs_Written      0x0000   100   253   ---    Old_age   Offline      -       12129067059
242 Total_LBAs_Read         0x0000   100   253   ---    Old_age   Offline      -       1187095200116

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25754         -
# 2  Short offline       Completed without error       00%     25586         -
# 3  Short offline       Completed without error       00%         9         -
# 4  Vendor (0xdf)       Completed without error       00%         9         -
# 5  Short offline       Completed without error       00%         4         -
# 6  Vendor (0xdf)       Completed without error       00%         3         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Looks a bit ropy! Or do Seagates have hard to interpret ass-backwards raw numbers? My other ones have similarly bad looking reallocated_sectors, seek_error_rate and hardware_ecc_recovered, while the Toshibas don't.

Chris Moore · Mar 27, 2019

JaimieV said:
Actually, looking at smartctl:

No, this is a bad drive. Needs to be replaced. Looking specifically at these values:

Code:

root@Sisyphus:~ # smartctl -a /dev/da7

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

See where is days PASSED? Always ignore that. If the drive comes ready at all, it passes, so you can have thousands of bad or reallocated sectors and the drive will still say PASSED. It is the most meaningless thing ever. I would go so far as to say it is a lie.

Code:

root@Sisyphus:~ # smartctl -a /dev/da7

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct   0x0133   099   099   ---    Pre-fail  Always       -       193

188 Command_Timeout         0x0032   100   099   ---    Old_age   Always       -       4295032833

196 Reallocated_Event_Count 0x0032   000   000   ---    Old_age   Always       -       22038

Then, under the section called, "Vendor Specific SMART Attributes with Thresholds:"

EDIT: I totally forgot to mention the "Reallocated_Sector_Ct".. Once you have a non-zero value here, the drive should be replaced as it is on it's last leg. There is reserved space in the drive that in normally uses and it only starts reporting a number here once it uses up all the reserved space. See the Reallocated_Event_Count. That is how many bad sectors there are, it just doesn't say it that way. Marketing.

you have some "Command_Timeout" problems which I would say are unusual and you have a non zero "Reallocated_Event_Count". The "Reallocated_Event_Count" is telling you that you have had that many bad sectors that needed to be remapped. This is usually an indicator that the drive is about to die.

Code:

root@Sisyphus:~ # smartctl -a /dev/da7

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25754         -
# 2  Short offline       Completed without error       00%     25586         -
# 3  Short offline       Completed without error       00%         9         -
# 4  Vendor (0xdf)       Completed without error       00%         9         -
# 5  Short offline       Completed without error       00%         4         -
# 6  Vendor (0xdf)       Completed without error       00%         3         -

Looking at the section titled, "SMART Self-test log structure", you have only got a couple tests in there and they are short tests. I would suggest configuring your system to run a short test daily and a long test once a week. This will help you detect problems with the mechanical disks before they become data errors in your pool.

Chris Moore · Mar 27, 2019

JaimieV said:
My other ones have similarly bad looking reallocated_sectors

Bad sectors? Will you show us some SMART data so we can figure it out?

JaimieV · Mar 27, 2019

Sure - here's da1, 5 and 6 (the others are Tosh, and have zeroes for the things that would be bad).
Could be special firmware or something, this was a box of of ex-datacentre drives from a refurb supplier. Cheap, and I binned a couple already during burnin.

/Edit: Gah! I misread the lines, these don't have reallocated_sector_ct > 0 at all. Replace time it is.
Various other error rates are still daft looking though... Raw_Read_Error_Rate, Seek_Error_Rate, Hardware_ECC_Recovered. Thoughts on that?
I thought I'd set up all the SMART scans, but that may have been on the burnin instance and I forgot on the final deploy. It's only domestic and is all backed up to my old microserver so it's no big deal. I'll set them up while it's resilvering onto a new drive. /endEdit

Code:

oot@Sisyphus:~ # smartctl -a /dev/da1
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES.3
Device Model:     ST4000NM0033-9ZM170
Serial Number:    Z1Z9E2F0
LU WWN Device Id: 5 000c50 086ec021b
Add. Product Id:  DELL(tm)
Firmware Version: GA6E
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Mar 27 21:34:40 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   90) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 484) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x010f   078   063   ---    Pre-fail  Always       -       67490793
  3 Spin_Up_Time            0x0103   093   092   ---    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   ---    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0133   100   100   ---    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   088   064   ---    Pre-fail  Always       -       785537141
  9 Power_On_Hours          0x0032   071   071   ---    Old_age   Always       -       25913
10 Spin_Retry_Count        0x0013   100   100   ---    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       31
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   097   097   ---    Old_age   Always       -       3
190 Airflow_Temperature_Cel 0x0022   075   055   ---    Old_age   Always       -       25 (Min/Max 24/27)
191 G-Sense_Error_Rate      0x0032   100   100   ---    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   ---    Old_age   Always       -       30
193 Load_Cycle_Count        0x0032   100   100   ---    Old_age   Always       -       764
194 Temperature_Celsius     0x0022   025   045   ---    Old_age   Always       -       25 (0 11 0 0 0)
195 Hardware_ECC_Recovered  0x001a   059   006   ---    Old_age   Always       -       67490793
196 Reallocated_Event_Count 0x0032   000   000   ---    Old_age   Always       -       4128
197 Current_Pending_Sector  0x0012   100   100   ---    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   ---    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   ---    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   ---    Old_age   Offline      -       9650 (53 244 0)
241 Total_LBAs_Written      0x0000   100   253   ---    Old_age   Offline      -       9465156529
242 Total_LBAs_Read         0x0000   100   253   ---    Old_age   Offline      -       936470662367

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25820         -
# 2  Short offline       Completed without error       00%     25652         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@Sisyphus:~ # smartctl -a /dev/da5
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES.3
Device Model:     ST4000NM0033-9ZM170
Serial Number:    Z1Z90W5E
LU WWN Device Id: 5 000c50 07bb07a96
Add. Product Id:  DELL(tm)
Firmware Version: GA6C
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Mar 27 21:34:43 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   90) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 483) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x010f   081   063   ---    Pre-fail  Always       -       159385379
  3 Spin_Up_Time            0x0103   092   092   ---    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   095   095   ---    Old_age   Always       -       5526
  5 Reallocated_Sector_Ct   0x0133   100   100   ---    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   089   060   ---    Pre-fail  Always       -       837351606
  9 Power_On_Hours          0x0032   071   071   ---    Old_age   Always       -       25926
10 Spin_Retry_Count        0x0013   100   100   ---    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   095   095   ---    Old_age   Always       -       5524
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   ---    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   076   058   ---    Old_age   Always       -       24 (Min/Max 22/26)
191 G-Sense_Error_Rate      0x0032   100   100   ---    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   098   098   ---    Old_age   Always       -       5516
193 Load_Cycle_Count        0x0032   097   097   ---    Old_age   Always       -       6618
194 Temperature_Celsius     0x0022   024   042   ---    Old_age   Always       -       24 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   060   013   ---    Old_age   Always       -       159385379
196 Reallocated_Event_Count 0x0032   000   000   ---    Old_age   Always       -       21845
197 Current_Pending_Sector  0x0012   100   100   ---    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   ---    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   ---    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   ---    Old_age   Offline      -       26307 (41 56 0)
241 Total_LBAs_Written      0x0000   100   253   ---    Old_age   Offline      -       16207925530
242 Total_LBAs_Read         0x0000   100   253   ---    Old_age   Offline      -       1187869195807

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25833         -
# 2  Short offline       Completed without error       00%     25665         -
# 3  Short offline       Completed without error       00%         9         -
# 4  Vendor (0xdf)       Completed without error       00%         9         -
# 5  Short offline       Completed without error       00%         3         -
# 6  Vendor (0xdf)       Completed without error       00%         3         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@Sisyphus:~ # smartctl -a /dev/da6
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES.3
Device Model:     ST4000NM0033-9ZM170
Serial Number:    Z1Z9DDMV
LU WWN Device Id: 5 000c50 086ec3e9d
Add. Product Id:  DELL(tm)
Firmware Version: GA6E
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Mar 27 21:34:45 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   90) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 480) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x010f   079   063   ---    Pre-fail  Always       -       92896990
  3 Spin_Up_Time            0x0103   093   093   ---    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   ---    Old_age   Always       -       22
  5 Reallocated_Sector_Ct   0x0133   100   100   ---    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   090   064   ---    Pre-fail  Always       -       1295898482
  9 Power_On_Hours          0x0032   072   072   ---    Old_age   Always       -       25367
10 Spin_Retry_Count        0x0013   100   100   ---    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       20
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   ---    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   054   ---    Old_age   Always       -       26 (Min/Max 20/28)
191 G-Sense_Error_Rate      0x0032   100   100   ---    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   ---    Old_age   Always       -       17
193 Load_Cycle_Count        0x0032   096   096   ---    Old_age   Always       -       8689
194 Temperature_Celsius     0x0022   026   046   ---    Old_age   Always       -       26 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   062   007   ---    Old_age   Always       -       92896990
196 Reallocated_Event_Count 0x0032   000   000   ---    Old_age   Always       -       65535
197 Current_Pending_Sector  0x0012   100   100   ---    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   ---    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   ---    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   ---    Old_age   Offline      -       19627 (252 16 0)
241 Total_LBAs_Written      0x0000   100   253   ---    Old_age   Offline      -       5507610726
242 Total_LBAs_Read         0x0000   100   253   ---    Old_age   Offline      -       925195274171

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25274         -
# 2  Short offline       Completed without error       00%     25106         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

JaimieV · Mar 27, 2019

The Tosh drives look like this, which is far more the sort of thing I expect:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   100   100   000    Old_age   Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       11702
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       39
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   030   030   000    Old_age   Always       -       28059
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       38
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       40
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       20 (Min/Max 11/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       12730660600
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       7894455104

Chris Moore · Mar 27, 2019

JaimieV said:
Various other error rates are daft looking though... Raw_Read_Error_Rate, Seek_Error_Rate, Hardware_ECC_Recovered. Thoughts?

those values usually look kind of crazy because of the way Seagate stores the the number. it's actually two numbers run together. somewhere on the forum there's a document that explains how to read that but those aren't very indicative of a drive failure. the thing to look at is the reallocated count reallocated event, it's a nonzero on all of those drive is so even though they don't have sectors that are marked as reallocated it has reallocated sectors

JaimieV · Mar 27, 2019

Thanks Chris. Bless you Seagate, making life difficult!
Looks like it could be a pretty iffy box of disks, though may just be age. I'll run burnin on the rest and eliminate the dropouts.

Chris Moore · Mar 27, 2019

JaimieV said:
Thanks Chris. Bless you Seagate, making life difficult!
Looks like it could be a pretty iffy box of disks, though may just be age. I'll run burnin on the rest and eliminate the dropouts.

I have a server at work that had 16 of that type drive and a little past the four year mark I had three of them fail in fairly quick succession. They are still under five years old and the other 13 are still going fine. I have another server still running, for now, that had 12 of the 2TB version of those drives and I had two of them fail around the five year mark. The other ten are still going strong at around 6.2 years of power-on hours with restarts under 60. You can never tell for sure, but the newer model drives appear to be a bit more reliable.
Generally, drives tend to fail more frequently between four and five years of use. There is a graph:

Failure rate in hard drives starts ticking up in year four, usually after the warranty is done. They try to build them well enough that they survive the warranty period.

Important Announcement for the TrueNAS Community.

Spare resilvering, then out again - but all "online"

JaimieV

Guru

JaimieV

Guru

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

JaimieV

Guru

JaimieV

Guru

Chris Moore

Hall of Famer

JaimieV

Guru

Chris Moore

Hall of Famer

Similar threads