Hi,
long time FreeNAS user, first time harddrive issue..
Basically, I've got 6 Seagate Exos X16 (ST16000NM002G) in a RAIDZ2 configuration inside a Supermicro 2U machine with BPN-SAS3-826EL1 backplane - this setup has been working just fine for 6 months.
I received email alert containing:
Checking on the specific disk reveals following:
ZFS pool seems healthy and has ONLINE status.
Is the disk broken and I should request replacement or is there anything I should try?
long time FreeNAS user, first time harddrive issue..
Basically, I've got 6 Seagate Exos X16 (ST16000NM002G) in a RAIDZ2 configuration inside a Supermicro 2U machine with BPN-SAS3-826EL1 backplane - this setup has been working just fine for 6 months.
I received email alert containing:
so I logged into the machine to check - syslog is full of CAM error messages originating from one specific drive (/dev/da2):Device: /dev/da2, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH.
Code:
Sep 2 00:00:00 nas newsyslog[49055]: logfile turned over due to size>200K Sep 2 00:00:00 nas.redacted.com syslog-ng[2194]: Configuration reload request received, reloading configuration; Sep 2 00:00:00 nas.redacted.com syslog-ng[2194]: Configuration reload finished; Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure) Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Info: 0 Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 20 Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data) Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure) Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Info: 0 Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 20 Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data) Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure) Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): Info: 0 ... Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: Deferred error: MEDIUM ERROR asc:c,2 (Write error - auto reallocation failed) Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Info: 0x38595b9a7 Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 6 Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Actual Retry Count: 255 Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data) Sep 3 11:16:10 nas.redacted.com collectd[1560]: Traceback (most recent call last): File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args) File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args) File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call raise CallTimeout("Call timeout") middlewared.client.client.CallTimeout: Call timeout
Checking on the specific disk reveals following:
Code:
# smartctl -a /dev/da2 smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p11 amd64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST16000NM002G Revision: E003 Compliance: SPC-5 User Capacity: 16,000,900,661,248 bytes [16.0 TB] Logical block size: 512 bytes Physical block size: 4096 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c500af1fa897 Serial number: REDACTED Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Thu Sep 3 11:32:01 2020 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=32] Grown defects during certification <not available> Total blocks reassigned during format <not available> Total new blocks reassigned = 295 Power on minutes since format <not available> Current Drive Temperature: 37 C Drive Trip Temperature: 60 C Manufactured in week 04 of year 2020 Specified cycle count over device lifetime: 50000 Accumulated start-stop cycles: 109 Specified load-unload count over device lifetime: 600000 Accumulated load-unload cycles: 284 Elements in grown defect list: 309 Vendor (Seagate Cache) information Blocks sent to initiator = 2066229080 Blocks received from initiator = 1150438328 Blocks read from cache and sent to initiator = 1046461852 Number of read and write commands whose size <= segment size = 57891915 Number of read and write commands whose size > segment size = 1865688 Vendor (Seagate/Hitachi) factory information number of hours powered up = 3678.77 number of minutes until next internal SMART test = 50 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 36208.974 0 write: 0 0 0 0 0 5017.512 3 verify: 0 30 0 30 197 33.307 0 Non-medium error count: 8 [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 3677 - [- - -] # 2 Background short Completed - 3667 - [- - -] # 3 Background short Completed - 3595 - [- - -] # 4 Background short Completed - 3547 - [- - -] # 5 Background short Completed - 3499 - [- - -] # 6 Background long Completed - 3477 - [- - -] # 7 Background short Completed - 3451 - [- - -] # 8 Background short Completed - 3403 - [- - -] # 9 Background short Completed - 3355 - [- - -] #10 Background short Completed - 3307 - [- - -] #11 Background short Completed - 3259 - [- - -] #12 Background short Completed - 3211 - [- - -] #13 Background short Completed - 3163 - [- - -] #14 Background long Completed - 3141 - [- - -] #15 Background short Completed - 3115 - [- - -] #16 Background short Completed - 3067 - [- - -] #17 Background short Completed - 3019 - [- - -] #18 Background short Completed - 2971 - [- - -] #19 Background short Completed - 2923 - [- - -] #20 Background short Completed - 2851 - [- - -] Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]
ZFS pool seems healthy and has ONLINE status.
Is the disk broken and I should request replacement or is there anything I should try?