disk got FAULTED after REPLACE during resilvering

b7842 · Jun 2, 2017

Story: I got this message yesterday.
Device: /dev/ada6, Self-Test Log error count increased from 0 to 1
So i did a smartctl -t long /dev/ada6 twice
And i got following result
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 164
3 Spin_Up_Time 0x0027 178 177 021 Pre-fail Always - 8066
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 87
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3621
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 87
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 33
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 99
194 Temperature_Celsius 0x0022 106 103 000 Old_age Always - 46
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 7
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 36

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 3618 8713320
# 2 Extended offline Completed: read failure 60% 3616 2901536408
# 3 Short offline Completed without error 00% 3595 -
# 4 Short offline Completed without error 00% 3565 -
# 5 Short offline Completed without error 00% 3541 -
# 6 Short offline Completed without error 00% 3517 -
# 7 Short offline Completed without error 00% 3493 -
# 8 Short offline Completed without error 00% 3469 -
# 9 Extended offline Completed without error 00% 3455 -
#10 Short offline Completed without error 00% 3422 -
#11 Short offline Completed without error 00% 3398 -
#12 Short offline Completed without error 00% 3374 -
#13 Short offline Completed without error 00% 3350 -
#14 Short offline Completed without error 00% 3326 -
#15 Short offline Completed without error 00% 3302 -
#16 Extended offline Completed without error 00% 3288 -
#17 Short offline Completed without error 00% 3259 -
#18 Short offline Completed without error 00% 3235 -
#19 Short offline Completed without error 00% 3211 -
#20 Short offline Completed without error 00% 3194 -
#21 Short offline Completed without error 00% 3162 -
-------------------------------------------------------------------------------------------------
After that, i replaced the ada6 with another spare wd 4 tb red (not brand new).
However, it got FAULTED status after REPLACE during re-silvering.
[root@freenas] ~# zpool status -v
pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 26 03:45:40 2017
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors

pool: zfs_v9_10_2
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jun 3 10:09:16 2017
2.02T scanned out of 5.90T at 159M/s, 7h6m to go
60.8G resilvered, 34.34% done
config:

NAME STATE READ WRITE CKSUM
zfs_v9_10_2 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/e2e50d98-d09f-11e6-8afa-d05099c192fa ONLINE 0 0 0
gptid/e3a11c0a-d09f-11e6-8afa-d05099c192fa ONLINE 0 0 0
gptid/e452f8df-d09f-11e6-8afa-d05099c192fa ONLINE 0 0 0
gptid/e5031b67-d09f-11e6-8afa-d05099c192fa ONLINE 0 0 0
gptid/e5ba8685-d09f-11e6-8afa-d05099c192fa ONLINE 0 0 0
gptid/4a88bff8-4789-11e7-91bc-a0369fa1458c FAULTED 0 73 0 too many errors (resilvering)
gptid/e73bb249-d09f-11e6-8afa-d05099c192fa ONLINE 0 0 0
gptid/e7f77bad-d09f-11e6-8afa-d05099c192fa ONLINE 0 0 0

errors: No known data errors
-------------------------------------------------------------------------------------------------
Here is the output of the spare 4tb red
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 35
3 Spin_Up_Time 0x0027 203 178 021 Pre-fail Always - 6808
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 171
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 198 197 000 Old_age Always - 96
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5084
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 149
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 114
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 5634
194 Temperature_Celsius 0x0022 107 106 000 Old_age Always - 45
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
ATA Error Count: 38 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 38 occurred at disk power-on lifetime: 5082 hours (211 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 80 00 40 40 Error: UNC 1 sectors at LBA = 0x00400080 = 4194432

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 80 00 40 40 08 11:47:50.787 READ DMA
c8 00 01 80 00 40 40 08 11:47:43.647 READ DMA
c8 00 01 80 00 40 40 08 11:47:36.507 READ DMA
c8 00 01 80 00 40 40 08 11:47:29.367 READ DMA
c8 00 01 80 00 40 40 08 11:47:22.226 READ DMA

Error 37 occurred at disk power-on lifetime: 5082 hours (211 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 80 00 40 40 Error: UNC 1 sectors at LBA = 0x00400080 = 4194432

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 80 00 40 40 08 11:47:43.647 READ DMA
c8 00 01 80 00 40 40 08 11:47:36.507 READ DMA
c8 00 01 80 00 40 40 08 11:47:29.367 READ DMA
c8 00 01 80 00 40 40 08 11:47:22.226 READ DMA
c8 00 01 81 00 40 40 08 11:47:15.085 READ DMA

Error 36 occurred at disk power-on lifetime: 5082 hours (211 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 80 00 40 40 Error: UNC 1 sectors at LBA = 0x00400080 = 4194432

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 80 00 40 40 08 11:47:36.507 READ DMA
c8 00 01 80 00 40 40 08 11:47:29.367 READ DMA
c8 00 01 80 00 40 40 08 11:47:22.226 READ DMA
c8 00 01 81 00 40 40 08 11:47:15.085 READ DMA
c8 00 01 81 00 40 40 08 11:47:07.945 READ DMA

Error 35 occurred at disk power-on lifetime: 5082 hours (211 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 80 00 40 40 Error: UNC 1 sectors at LBA = 0x00400080 = 4194432

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 80 00 40 40 08 11:47:29.367 READ DMA
c8 00 01 80 00 40 40 08 11:47:22.226 READ DMA
c8 00 01 81 00 40 40 08 11:47:15.085 READ DMA
c8 00 01 81 00 40 40 08 11:47:07.945 READ DMA
c8 00 01 81 00 40 40 08 11:47:00.805 READ DMA

Error 34 occurred at disk power-on lifetime: 5082 hours (211 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 80 00 40 40 Error: UNC 1 sectors at LBA = 0x00400080 = 4194432

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 80 00 40 40 08 11:47:22.226 READ DMA
c8 00 01 81 00 40 40 08 11:47:15.085 READ DMA
c8 00 01 81 00 40 40 08 11:47:07.945 READ DMA
c8 00 01 81 00 40 40 08 11:47:00.805 READ DMA
c8 00 01 81 00 40 40 08 11:46:53.666 READ DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 4806 -
# 2 Extended offline Aborted by host 80% 4797 -
# 3 Short offline Completed without error 00% 4753 -
# 4 Short offline Completed without error 00% 4751 -
# 5 Short offline Completed without error 00% 4746 -
# 6 Short offline Completed without error 00% 4705 -
# 7 Short offline Completed without error 00% 4699 -
# 8 Short offline Completed without error 00% 4223 -
# 9 Short offline Completed without error 00% 4219 -
#10 Short offline Completed without error 00% 4217 -
#11 Short offline Completed without error 00% 2790 -
#12 Short offline Completed without error 00% 1791 -
#13 Short offline Completed without error 00% 211 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
------------------------------------------------------------------------------------------------------------------------------
Questions:
1) I guess the old disk at ada6 may has lots of pending sector. The spare 4tb red is not a healthy disk as well. Am i Correct?
2) Should I shutdown the machine and remove the spare 4tb red? (re-silvering is still running)

SweetAndLow · Jun 3, 2017

Yes you need to get another disk. Both look to be dying.

Offline the faulted disk. Shutdown system. Get new disk. Remove old disk. Install new disk. Start system. Replace offline disk with new disk in the GUI.

Sent from my Nexus 5X using Tapatalk

b7842 · Jun 3, 2017

SweetAndLow said:
Yes you need to get another disk. Both look to be dying.

Offline the faulted disk. Shutdown system. Get new disk. Remove old disk. Install new disk. Start system. Replace offline disk with new disk in the GUI.

Sent from my Nexus 5X using Tapatalk

I have bad luck. After filling zero into the spare one, 10 reallocated sectors were found.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_
FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always -
49
3 Spin_Up_Time 0x0027 204 178 021 Pre-fail Always -
6766
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always -
173
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always -
10
7 Seek_Error_Rate 0x002e 199 197 000 Old_age Always -
102
9 Power_On_Hours 0x0032 094 094 000 Old_age Always -
5101
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always -
0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always -
0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always -
151
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always -
115
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always -
5635
194 Temperature_Celsius 0x0022 119 106 000 Old_age Always -
33
196 Reallocated_Event_Count 0x0032 190 190 000 Old_age Always -
10
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always -
0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline -
0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always -
0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline -
1

rs225 · Jun 3, 2017

10 Reallocated sectors is not bad luck. It means 10 sectors were successfully relocated. The drive may be suitable for use as long as you continue to watch it. If those numbers keep growing, or pending sectors starts growing, it would then be best to replace it and stop using it. It would be better to use the drive with 10 reallocated sectors than the drive that is FAULTED with 73 write errors.

DrKK · Jun 3, 2017

2x4TB WD Red at Newegg this weekend, I believe it's $259 for 2.

Stux · Jun 3, 2017

Reallocated sectors on an under warranty drive is cause to RMA.

DId you burn in the drives?

b7842 · Jun 4, 2017

Stux said:
Reallocated sectors on an under warranty drive is cause to RMA.

DId you burn in the drives?

For the first one, pending sectors were found, So, no need to fill zero to discover the hidden bad sectors.
For the spare one, I filling zero to it and it reallocated the hidden bad sectors.
Moreover, I called the local warranty service to collect both 4tb red hard disks next week.

joeschmuck · Jun 4, 2017

If your data is not backed up, you should at least backup the important stuff. You could be at risk of data loss even before you get a replacement drive.

Important Announcement for the TrueNAS Community.

disk got FAULTED after REPLACE during resilvering

b7842

Dabbler

SweetAndLow

Sweet'NASty

b7842

Dabbler

rs225

Guru

DrKK

FreeNAS Generalissimo

Stux

MVP

b7842

Dabbler

joeschmuck

Old Man

Similar threads