novacrasher
Dabbler
- Joined
- Nov 25, 2018
- Messages
- 11
Within the last few months I upgraded my NAS from a 2 drive mirror setup (set up in March 2015) to a RaidZ1 setup by adding a third HDD (all three drives are WD Red drives 3TB ea; the newest drive is called a WD Red Plus). On 30 August I received an email stating:
I turned off the NAS, checked all cable connections, rebooted and on 4 September received the below email message:
The same HDD SN was being reference.
Later on 4 September I received another email alert:
Followed by another email also on 4 September:
I tried looking at the SMART data from the drive and unfortunately SMART was turned off for that device (it was enabled for my original 2 devices but not the newest one; rookie move). I enabled SMART testing for this device and also ran some manual short/long tests.
I received many more emails. Below is data from multiple emails:
I read through some forums that suggested to swap cables so I connected the HDD in question to a new power cable and also swapped SATA Cables/positions with another device (HDD now on a different SATA port on the mobo).
More emails on 6 and 8 September (note the change in /dev name since the HDD is now plugged into a different SATA port on the mobo).
Here is the output of smartctl:
Here is a partial output of DMESG (there are a lot of these ATA Status Errors
So do I actually have a failing device? What data do I need to provide to WD for a warranty claim? I bought the drive through Amazon on 9 Sept 2021 but just got around to installing it a few months ago.
NAS Specs below:
TrueNAS-12.0-U8.1
Intel(R) Pentium(R) CPU G3220 @ 3.00GHz
2x Crucial Ballistix Sport 8GB 240-Pin DDR3 SDRAM DDR
ASRock H97M-ITX/ac LGA 1150 Intel H97 HDMI SATA 6G
Thank you for the help and sorry for the long post!
I immediately ordered a new HDD in case one of my older drives was failing. When I had a chance to look into it further, it turned out that the SN was for my newest HDD which is still covered under warranty.Pool nasdrive state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
I turned off the NAS, checked all cable connections, rebooted and on 4 September received the below email message:
New alert:
* Pool nasdrive state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
The following alert has been cleared:
* Pool nasdrive state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
The same HDD SN was being reference.
Later on 4 September I received another email alert:
New alert:
* Pool nasdrive state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
The following alert has been cleared:
* Pool nasdrive state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
Followed by another email also on 4 September:
The following alert has been cleared:
* Pool nasdrive state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
I tried looking at the SMART data from the drive and unfortunately SMART was turned off for that device (it was enabled for my original 2 devices but not the newest one; rookie move). I enabled SMART testing for this device and also ran some manual short/long tests.
I received many more emails. Below is data from multiple emails:
Device: /dev/ada0, not capable of SMART self-check.
New alerts:
* Device: /dev/ada0, Read SMART Self-Test Log Failed.
New alerts:
* Device: /dev/ada0, Read SMART Error Log Failed.
New alerts:
* Device: /dev/ada0, failed to read SMART Attribute Data.
New alerts:
* Device: /dev/ada0, ATA error count increased from 0 to 1.
* Pool nasdrive state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
- Disk 14854379811529545774 is UNAVAIL
I read through some forums that suggested to swap cables so I connected the HDD in question to a new power cable and also swapped SATA Cables/positions with another device (HDD now on a different SATA port on the mobo).
More emails on 6 and 8 September (note the change in /dev name since the HDD is now plugged into a different SATA port on the mobo).
New alerts:
* Device: /dev/ada3, not capable of SMART self-check.
New alerts:
* Device: /dev/ada3, failed to read SMART Attribute Data.
New alerts:
* Device: /dev/ada3, 1 Currently unreadable (pending) sectors.
Here is the output of smartctl:
Welcome to FreeNAS
Warning: settings changed through the CLI are not written to
the configuration database and will be reset on reboot.
root@MJNAS:~ # smartctl -a /dev/ada3
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: WDC WD30EFZX-68AWUN0
Serial Number: WD-WX32D2122FAD
LU WWN Device Id: 5 0014ee 2beb38a11
Firmware Version: 81.00B81
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Sep 8 12:55:28 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 33) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: (32820) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 349) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 188 188 051 Pre-fail Always - 517
3 Spin_Up_Time 0x0027 202 199 021 Pre-fail Always - 2883
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2510
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 4
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 113 097 000 Old_age Always - 34
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1 occurred at disk power-on lifetime: 2433 hours (101 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 08 98 01 40 40 Error: IDNF at LBA = 0x00400198 = 4194712
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 08 98 01 40 40 08 21:59:29.815 WRITE DMA
e5 00 00 00 00 00 40 08 21:59:29.815 CHECK POWER MODE
ea 00 00 00 00 00 40 08 21:59:21.828 FLUSH CACHE EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Interrupted (host reset) 10% 2501 -
# 2 Short offline Interrupted (host reset)
Here is a partial output of DMESG (there are a lot of these ATA Status Errors
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 60 43 2e 00 bb 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 38 44 2e 40 bb 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 38 44 2e 00 bb 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 28 45 2e 40 bb 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 28 45 2e 00 bb 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 88 fb d7 40 b9 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 88 fb d7 00 b9 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 c0 fc d7 40 b9 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 c0 fc d7 00 b9 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 20 fe d7 40 b9 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 20 fe d7 00 b9 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 a8 57 2e 40 bb 00 00 00 0000
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada3:ahcich3:0:0:0): RES: 41 10 a8 57 2e 00 bb 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
ahcich3: Timeout on slot 28 port 0
ahcich3: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd d0 serr 00000000 cmd 0000dc17
(ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada3:ahcich3:0:0:0): CAM status: Command timeout
(ada3:ahcich3:0:0:0): Retrying command, 0 more tries remain
So do I actually have a failing device? What data do I need to provide to WD for a warranty claim? I bought the drive through Amazon on 9 Sept 2021 but just got around to installing it a few months ago.
NAS Specs below:
TrueNAS-12.0-U8.1
Intel(R) Pentium(R) CPU G3220 @ 3.00GHz
2x Crucial Ballistix Sport 8GB 240-Pin DDR3 SDRAM DDR
ASRock H97M-ITX/ac LGA 1150 Intel H97 HDMI SATA 6G
Thank you for the help and sorry for the long post!