CAM reporting SCSI Status errors - new disk dead?

Dkzero

Cadet
Joined
Sep 3, 2020
Messages
3
Hi,

long time FreeNAS user, first time harddrive issue..
Basically, I've got 6 Seagate Exos X16 (ST16000NM002G) in a RAIDZ2 configuration inside a Supermicro 2U machine with BPN-SAS3-826EL1 backplane - this setup has been working just fine for 6 months.

I received email alert containing:
Device: /dev/da2, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH.
so I logged into the machine to check - syslog is full of CAM error messages originating from one specific drive (/dev/da2):

Code:
Sep  2 00:00:00 nas newsyslog[49055]: logfile turned over due to size>200K
Sep  2 00:00:00 nas.redacted.com syslog-ng[2194]: Configuration reload request received, reloading configuration;
Sep  2 00:00:00 nas.redacted.com syslog-ng[2194]: Configuration reload finished;
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure)
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Info: 0
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 20
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data)
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure)
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Info: 0
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 20
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data)
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure)
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): Info: 0
...
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: Deferred error: MEDIUM ERROR asc:c,2 (Write error - auto reallocation failed)
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Info: 0x38595b9a7
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 6
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Actual Retry Count: 255
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data)
Sep  3 11:16:10 nas.redacted.com collectd[1560]: Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read
    temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read
    temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call
    raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout


Checking on the specific disk reveals following:

Code:
# smartctl -a /dev/da2
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST16000NM002G
Revision:             E003
Compliance:           SPC-5
User Capacity:        16,000,900,661,248 bytes [16.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500af1fa897
Serial number:        REDACTED
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Sep  3 11:32:01 2020 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=32]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 295
Power on minutes since format <not available>
Current Drive Temperature:     37 C
Drive Trip Temperature:        60 C

Manufactured in week 04 of year 2020
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  109
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  284
Elements in grown defect list: 309

Vendor (Seagate Cache) information
  Blocks sent to initiator = 2066229080
  Blocks received from initiator = 1150438328
  Blocks read from cache and sent to initiator = 1046461852
  Number of read and write commands whose size <= segment size = 57891915
  Number of read and write commands whose size > segment size = 1865688

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 3678.77
  number of minutes until next internal SMART test = 50

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      36208.974           0
write:         0        0         0         0          0       5017.512           3
verify:        0       30         0        30        197         33.307           0

Non-medium error count:        8


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -    3677                 - [-   -    -]
# 2  Background short  Completed                   -    3667                 - [-   -    -]
# 3  Background short  Completed                   -    3595                 - [-   -    -]
# 4  Background short  Completed                   -    3547                 - [-   -    -]
# 5  Background short  Completed                   -    3499                 - [-   -    -]
# 6  Background long   Completed                   -    3477                 - [-   -    -]
# 7  Background short  Completed                   -    3451                 - [-   -    -]
# 8  Background short  Completed                   -    3403                 - [-   -    -]
# 9  Background short  Completed                   -    3355                 - [-   -    -]
#10  Background short  Completed                   -    3307                 - [-   -    -]
#11  Background short  Completed                   -    3259                 - [-   -    -]
#12  Background short  Completed                   -    3211                 - [-   -    -]
#13  Background short  Completed                   -    3163                 - [-   -    -]
#14  Background long   Completed                   -    3141                 - [-   -    -]
#15  Background short  Completed                   -    3115                 - [-   -    -]
#16  Background short  Completed                   -    3067                 - [-   -    -]
#17  Background short  Completed                   -    3019                 - [-   -    -]
#18  Background short  Completed                   -    2971                 - [-   -    -]
#19  Background short  Completed                   -    2923                 - [-   -    -]
#20  Background short  Completed                   -    2851                 - [-   -    -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]


ZFS pool seems healthy and has ONLINE status.
Is the disk broken and I should request replacement or is there anything I should try?
 

Dkzero

Cadet
Joined
Sep 3, 2020
Messages
3
Forgot to mention, Supermicro backplane is connected to a HBA card (AOC-S3008L-L8e) in IT-mode, latest firmware. In case it matters in any way...
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
From what the SMART data is showing, you shouldn't continue to use that disk. Since it has about 5 months of run time on it, it should be able to be RMA'd.

Interesting that with the manufacture date of week 4 of 2020, that means it barely had enough time to come off the factory line and get shipped to you before it was then spinning without a break until now.
 

Dkzero

Cadet
Joined
Sep 3, 2020
Messages
3
From what the SMART data is showing, you shouldn't continue to use that disk. Since it has about 5 months of run time on it, it should be able to be RMA'd.

Interesting that with the manufacture date of week 4 of 2020, that means it barely had enough time to come off the factory line and get shipped to you before it was then spinning without a break until now.

Agreed, this drive should not die now but it did.. They have 5 year warranty on these drives so I've sent a request to Seagate as well as the place of purchase - no response yet. Regardless, should I take out the disk (detach) from RAID now?
 
Top