Warranty Claim for potentially failing Drive

novacrasher · Sep 8, 2022

Within the last few months I upgraded my NAS from a 2 drive mirror setup (set up in March 2015) to a RaidZ1 setup by adding a third HDD (all three drives are WD Red drives 3TB ea; the newest drive is called a WD Red Plus). On 30 August I received an email stating:

Pool nasdrive state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

I immediately ordered a new HDD in case one of my older drives was failing. When I had a chance to look into it further, it turned out that the SN was for my newest HDD which is still covered under warranty.

I turned off the NAS, checked all cable connections, rebooted and on 4 September received the below email message:

New alert:
* Pool nasdrive state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

The following alert has been cleared:
* Pool nasdrive state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

The same HDD SN was being reference.

Later on 4 September I received another email alert:

New alert:
* Pool nasdrive state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

The following alert has been cleared:
* Pool nasdrive state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

Followed by another email also on 4 September:

The following alert has been cleared:
* Pool nasdrive state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

I tried looking at the SMART data from the drive and unfortunately SMART was turned off for that device (it was enabled for my original 2 devices but not the newest one; rookie move). I enabled SMART testing for this device and also ran some manual short/long tests.

I received many more emails. Below is data from multiple emails:

Device: /dev/ada0, not capable of SMART self-check.

New alerts:
* Device: /dev/ada0, Read SMART Self-Test Log Failed.

New alerts:
* Device: /dev/ada0, Read SMART Error Log Failed.

New alerts:
* Device: /dev/ada0, failed to read SMART Attribute Data.

New alerts:
* Device: /dev/ada0, ATA error count increased from 0 to 1.

* Pool nasdrive state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk 14854379811529545774 is UNAVAIL

I read through some forums that suggested to swap cables so I connected the HDD in question to a new power cable and also swapped SATA Cables/positions with another device (HDD now on a different SATA port on the mobo).

More emails on 6 and 8 September (note the change in /dev name since the HDD is now plugged into a different SATA port on the mobo).

New alerts:
* Device: /dev/ada3, not capable of SMART self-check.

New alerts:
* Device: /dev/ada3, failed to read SMART Attribute Data.

New alerts:
* Device: /dev/ada3, 1 Currently unreadable (pending) sectors.

Here is the output of smartctl:

Welcome to FreeNAS

Warning: settings changed through the CLI are not written to
the configuration database and will be reset on reboot.

root@MJNAS:~ # smartctl -a /dev/ada3
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: WDC WD30EFZX-68AWUN0
Serial Number: WD-WX32D2122FAD
LU WWN Device Id: 5 0014ee 2beb38a11
Firmware Version: 81.00B81
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Sep 8 12:55:28 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 33) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: (32820) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 349) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 188 188 051 Pre-fail Always - 517
3 Spin_Up_Time 0x0027 202 199 021 Pre-fail Always - 2883
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2510
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 4
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 113 097 000 Old_age Always - 34
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 2433 hours (101 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 08 98 01 40 40 Error: IDNF at LBA = 0x00400198 = 4194712

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 08 98 01 40 40 08 21:59:29.815 WRITE DMA
e5 00 00 00 00 00 40 08 21:59:29.815 CHECK POWER MODE
ea 00 00 00 00 00 40 08 21:59:21.828 FLUSH CACHE EXT

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Interrupted (host reset) 10% 2501 -
# 2 Short offline Interrupted (host reset)

Here is a partial output of DMESG (there are a lot of these ATA Status Errors

(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 60 43 2e 00 bb 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 38 44 2e 40 bb 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 38 44 2e 00 bb 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 28 45 2e 40 bb 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 28 45 2e 00 bb 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 88 fb d7 40 b9 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 88 fb d7 00 b9 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 c0 fc d7 40 b9 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 c0 fc d7 00 b9 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 20 fe d7 40 b9 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 20 fe d7 00 b9 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 a8 57 2e 40 bb 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 a8 57 2e 00 bb 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
ahcich3: Timeout on slot 28 port 0
ahcich3: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd d0 serr 00000000 cmd 0000dc17
(ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada3:ahcich3:0:0:0): CAM status: Command timeout
(ada3:ahcich3:0:0:0): Retrying command, 0 more tries remain

So do I actually have a failing device? What data do I need to provide to WD for a warranty claim? I bought the drive through Amazon on 9 Sept 2021 but just got around to installing it a few months ago.

NAS Specs below:
TrueNAS-12.0-U8.1
Intel(R) Pentium(R) CPU G3220 @ 3.00GHz
2x Crucial Ballistix Sport 8GB 240-Pin DDR3 SDRAM DDR
ASRock H97M-ITX/ac LGA 1150 Intel H97 HDMI SATA 6G

Thank you for the help and sorry for the long post!

NugentS · Sep 8, 2022

You need to take the drive out (I am a little concerned about your "upgrade mirror to Z1")* put it into a windows machine and run the WD utilities against it. This will provide the evidence you need for WD to RMA the drive.

*I assume you meant trashed the mirror pool and then rebuilt it as Z1

Can you post the output of zpool status in codeblocks please, before you remove the drive

novacrasher · Sep 8, 2022

Upgrade is poor wording. You are correct, I backed up my data, trashed the mirror pool and then created a new Z1 pool with the three drives.

Here is the output of the zpool status.

Code:


  pool: freenas-boot
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:01:26 with 0 errors on Thu Sep  8 03:46:26 2022
config:

        NAME          STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2       ONLINE       0     0     0

errors: No known data errors

  pool: nasdrive
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 48.5M in 00:00:05 with 0 errors on Mon Sep  5 11:25:31 2022
config:

config:

        NAME                                            STATE     READ WRITE CKSUM
        nasdrive                                        ONLINE       0     0 0
          raidz1-0                                      ONLINE       0     0 0
            gptid/7c576001-cccc-11ec-8b0c-d05099630559  ONLINE      40     0 0
            gptid/7c83ece1-cccc-11ec-8b0c-d05099630559  ONLINE       0     0 0
            gptid/7c8f4a4c-cccc-11ec-8b0c-d05099630559  ONLINE       0     0 0

errors: No known data errors

A couple days ago I did do a 'zpool clear' but more error occurred resulting in its current state.

I will remove the drive and run those WD tools. Great suggestion.

NugentS · Sep 9, 2022

According to that you have a single drive in your Z1 Pool. Hopefully you have missed some stuff out.......
Aha - more stuff has appeared this time.
OK - 3 disks (only one was appearing last time I logged in)

novacrasher · Sep 9, 2022

*zpool status edited/updated

I also received a few more email alerts last night:

Device /dev/gptid/7c576001-cccc-11ec-8b0c-d05099630559 is causing slow I/O on pool nasdrive.

and

Device: /dev/ada3, ATA error count increased from 1 to 8.

NugentS · Sep 9, 2022

Have you:
1. Taken the disk out and tested it
2. Posted your hardware setup properly?

novacrasher · Sep 11, 2022

I was unable to find the proper tools through WD to test the drive. I called customer service and explained the SMART errors I was receiving and asked what further information I needed to provide. They issued me an RMA number on the spot and I am packaging my drive up for return to WD.

A couple frustrating things to note. I had to pay for shipping back to WD. It is a discounted UPS rate but still kind of annoying. The shipping instructions were also pretty specific and required 2 inches of bubble wrap and and ESD bag. Fortunately I had just purchased another HDD since I thought one of my older drives had died so I was able to use that ESD bag.

Thanks again for all the help and prompt replies!

ChrisRJ · Sep 12, 2022

That RMA procedure is comparable to what I experienced with Seagate for their Exos drives here in Germany.

Important Announcement for the TrueNAS Community.

Warranty Claim for potentially failing Drive

novacrasher

Dabbler

NugentS

MVP

novacrasher

Dabbler

NugentS

MVP

novacrasher

Dabbler

NugentS

MVP

novacrasher

Dabbler

ChrisRJ

Wizard

Similar threads