Drive faulted due to errors, next steps?

Status
Not open for further replies.

jnyl42

Dabbler
Joined
Dec 1, 2014
Messages
16
My (fairly new) 10 disk RaidZ2 just disabled a drive due to persistent errors (6 read errors, 69 write errors according to volume status). This is the first time I've had this issue after putting the system to use. What are my next steps?

- Can I do some diagnoses to make sure it's the disk going bad, and not just a cable or something? It's a relatively new build, ~ 4 months old.

- Should I go ahead and order/buy a replacement disk? Will I have compatibility issues if I change the WD Red with an HGST Deskstar NAS?

Any other considerations?

Thanks for the help. The user manual only says how to replace the disk.
 

enemy85

Guru
Joined
Jun 10, 2011
Messages
757
First see what a smartctl -a /dev/ada××× shows...
 

jnyl42

Dabbler
Joined
Dec 1, 2014
Messages
16
First see what a smartctl -a /dev/ada××× shows...

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-[snip]
LU WWN Device Id: 5 0014ee 209b9d7f4
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Apr 3 02:31:13 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (41640) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 418) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 170 170 021 Pre-fail Always - 6466
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 989
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7638
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 203
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 97
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2497
194 Temperature_Celsius 0x0022 128 107 000 Old_age Always - 22
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 195 000 Old_age Always - 1145
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 7637 -
# 2 Short offline Completed without error 00% 7589 -
# 3 Short offline Completed without error 00% 7565 -
# 4 Short offline Completed without error 00% 7517 -
# 5 Short offline Completed without error 00% 7469 -
# 6 Short offline Completed without error 00% 7421 -
# 7 Short offline Completed without error 00% 7373 -
# 8 Extended offline Completed without error 00% 7358 -
# 9 Short offline Completed without error 00% 7325 -
#10 Short offline Completed without error 00% 7277 -
#11 Short offline Completed without error 00% 7230 -
#12 Short offline Completed without error 00% 7181 -
#13 Short offline Completed without error 00% 7133 -
#14 Short offline Completed without error 00% 7085 -
#15 Short offline Completed without error 00% 7037 -
#16 Short offline Completed without error 00% 6990 -
#17 Short offline Completed without error 00% 6942 -
#18 Short offline Completed without error 00% 6894 -
#19 Short offline Completed without error 00% 6847 -
#20 Short offline Completed without error 00% 6799 -
#21 Short offline Completed without error 00% 6755 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

enemy85

Guru
Joined
Jun 10, 2011
Messages
757
I would first check the sata and power cables. U don't have pending or offline sectors either, so of it was me, i would keep an eye on that disk but wait before replacing.
In any case wait someone else suggestion to decide.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
I agree ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
CRC errors are characteristic of cable issues. Make sure all SATA connections are less than 1m in length.
 

jnyl42

Dabbler
Joined
Dec 1, 2014
Messages
16
Damn. I rebooted to put the drive back online, had some checksum errors but seemed to be ok... but it just faulted and degraded the volume again.

CRC errors are characteristic of cable issues. Make sure all SATA connections are less than 1m in length.

It's a 2ft SAS to SATA breakout cable... but it's some no-name brand. I could try replacing it but I have no idea where to find a good one.

Is it possible to switch the drive to a different SATA port?
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
Sure. FreeNAS uses gptid's and it doesn't matter which physical port you plug the drive into.

Is it possible to switch the drive to a different SATA port?
 

jnyl42

Dabbler
Joined
Dec 1, 2014
Messages
16
Thanks for all the help. I tried 3 different SATA cables on different ports, but it never made it more than 30 minutes before the disk would go offline again. I went ahead and replaced the disk. WDs have a nice warranty so I'll get it serviced and keep it as a spare.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
@jnyl42

Your power cycle count is 203 while your start stop count was 989. So either your disk is having to spin down and up because of errors or you are putting your disk to sleep.

*IF* you are putting your disk to sleep, it is possible that the disk isn't waking up in time before ZFS times out and fails the drive. This is one of many reasons why I don't recommend sleeping disks.

Just something to think about...
 

jnyl42

Dabbler
Joined
Dec 1, 2014
Messages
16
By putting the disk to sleep, you mean the setting "HDD Standby"? In that setting, all my disks are set to "Always On"

Just checked a good disk, and the power cycle and start-stop counts match perfectly, so I don't think they're sleeping

Appreciate your input though as always CJ :)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That is what I mean. You can also change the APM value as some drives will spin-down based on that number.

Yeah, just replace the disk. It's clearly not doing well.
 
Status
Not open for further replies.
Top