CAM reporting SCSI Status errors - new disk dead?

Dkzero · Sep 3, 2020

Hi,

long time FreeNAS user, first time harddrive issue..
Basically, I've got 6 Seagate Exos X16 (ST16000NM002G) in a RAIDZ2 configuration inside a Supermicro 2U machine with BPN-SAS3-826EL1 backplane - this setup has been working just fine for 6 months.

I received email alert containing:

Device: /dev/da2, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH.

so I logged into the machine to check - syslog is full of CAM error messages originating from one specific drive (/dev/da2):

Code:

Sep  2 00:00:00 nas newsyslog[49055]: logfile turned over due to size>200K
Sep  2 00:00:00 nas.redacted.com syslog-ng[2194]: Configuration reload request received, reloading configuration;
Sep  2 00:00:00 nas.redacted.com syslog-ng[2194]: Configuration reload finished;
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure)
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Info: 0
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 20
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep  2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data)
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure)
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Info: 0
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 20
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep  2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data)
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure)
Sep  2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): Info: 0
...
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: Deferred error: MEDIUM ERROR asc:c,2 (Write error - auto reallocation failed)
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Info: 0x38595b9a7
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 6
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Actual Retry Count: 255
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep  3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data)
Sep  3 11:16:10 nas.redacted.com collectd[1560]: Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read
    temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read
    temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call
    raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout

Checking on the specific disk reveals following:

Code:

# smartctl -a /dev/da2
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST16000NM002G
Revision:             E003
Compliance:           SPC-5
User Capacity:        16,000,900,661,248 bytes [16.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500af1fa897
Serial number:        REDACTED
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Sep  3 11:32:01 2020 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=32]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 295
Power on minutes since format <not available>
Current Drive Temperature:     37 C
Drive Trip Temperature:        60 C

Manufactured in week 04 of year 2020
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  109
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  284
Elements in grown defect list: 309

Vendor (Seagate Cache) information
  Blocks sent to initiator = 2066229080
  Blocks received from initiator = 1150438328
  Blocks read from cache and sent to initiator = 1046461852
  Number of read and write commands whose size <= segment size = 57891915
  Number of read and write commands whose size > segment size = 1865688

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 3678.77
  number of minutes until next internal SMART test = 50

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      36208.974           0
write:         0        0         0         0          0       5017.512           3
verify:        0       30         0        30        197         33.307           0

Non-medium error count:        8


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -    3677                 - [-   -    -]
# 2  Background short  Completed                   -    3667                 - [-   -    -]
# 3  Background short  Completed                   -    3595                 - [-   -    -]
# 4  Background short  Completed                   -    3547                 - [-   -    -]
# 5  Background short  Completed                   -    3499                 - [-   -    -]
# 6  Background long   Completed                   -    3477                 - [-   -    -]
# 7  Background short  Completed                   -    3451                 - [-   -    -]
# 8  Background short  Completed                   -    3403                 - [-   -    -]
# 9  Background short  Completed                   -    3355                 - [-   -    -]
#10  Background short  Completed                   -    3307                 - [-   -    -]
#11  Background short  Completed                   -    3259                 - [-   -    -]
#12  Background short  Completed                   -    3211                 - [-   -    -]
#13  Background short  Completed                   -    3163                 - [-   -    -]
#14  Background long   Completed                   -    3141                 - [-   -    -]
#15  Background short  Completed                   -    3115                 - [-   -    -]
#16  Background short  Completed                   -    3067                 - [-   -    -]
#17  Background short  Completed                   -    3019                 - [-   -    -]
#18  Background short  Completed                   -    2971                 - [-   -    -]
#19  Background short  Completed                   -    2923                 - [-   -    -]
#20  Background short  Completed                   -    2851                 - [-   -    -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

ZFS pool seems healthy and has ONLINE status.
Is the disk broken and I should request replacement or is there anything I should try?

Dkzero · Sep 3, 2020

Forgot to mention, Supermicro backplane is connected to a HBA card (AOC-S3008L-L8e) in IT-mode, latest firmware. In case it matters in any way...

sretalla · Sep 3, 2020

From what the SMART data is showing, you shouldn't continue to use that disk. Since it has about 5 months of run time on it, it should be able to be RMA'd.

Interesting that with the manufacture date of week 4 of 2020, that means it barely had enough time to come off the factory line and get shipped to you before it was then spinning without a break until now.

Dkzero · Sep 3, 2020

sretalla said:
From what the SMART data is showing, you shouldn't continue to use that disk. Since it has about 5 months of run time on it, it should be able to be RMA'd.

Interesting that with the manufacture date of week 4 of 2020, that means it barely had enough time to come off the factory line and get shipped to you before it was then spinning without a break until now.

Agreed, this drive should not die now but it did.. They have 5 year warranty on these drives so I've sent a request to Seagate as well as the place of purchase - no response yet. Regardless, should I take out the disk (detach) from RAID now?

Important Announcement for the TrueNAS Community.

CAM reporting SCSI Status errors - new disk dead?

Dkzero

Cadet

Dkzero

Cadet

sretalla

Powered by Neutrality

Dkzero

Cadet

Similar threads