Is my drive going bad?

Status
Not open for further replies.

Vinh

Dabbler
Joined
Apr 5, 2016
Messages
12
Hi fairly new to freenas. it has been run fine until recently. I have been getting a bunch of
CAM status: Uncorrectable parity/CRC error
(ada2:ahcich7:0:0:0): Retrying command.

I have tried swapping cables and sata ports. It seems to go away for a while but then comes back. Today I got a critical email with
The volume Data (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. looks like ada2 is the culpret again. I am runing with 3 WD red 3TB in a ZFS raid. Any help would be great. Thanks

Here is my smartctl -a for ada2

Code:
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N5LLP6NE
LU WWN Device Id: 5 0014ee 2620c9d8b
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Apr 13 18:56:17 2016 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (39600) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 398) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   181   179   021    Pre-fail  Always       -       5908
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       28
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2468
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       28
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       18
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       36
194 Temperature_Celsius     0x0022   121   120   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       14
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 2
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 2467 hours (102 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 80 70 94 a6 49  Error: IDNF at LBA = 0x09a69470 = 161911920

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 80 70 94 a6 49 00   7d+21:10:54.726  WRITE DMA
  ef 02 00 00 00 00 40 00   7d+21:10:54.726  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00   7d+21:10:54.726  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00   7d+21:10:54.726  SET MULTIPLE MODE
  ef 10 02 00 00 00 40 00   7d+21:10:54.726  SET FEATURES [Enable SATA feature]

Error 1 occurred at disk power-on lifetime: 2466 hours (102 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 80 f0 84 a6 49  Error: IDNF at LBA = 0x09a684f0 = 161907952

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 80 f0 84 a6 49 00   7d+20:46:10.527  WRITE DMA
  ef 02 00 00 00 00 40 00   7d+20:46:10.527  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00   7d+20:46:10.527  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00   7d+20:46:10.527  SET MULTIPLE MODE
  ef 10 02 00 00 00 40 00   7d+20:46:10.527  SET FEATURES [Enable SATA feature]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Your drive is probably fine. It's suffering from interface issues, which are almost always bad SATA cables.

Replace the SATA cables (keep them as short as is viable and always under 1m) and we'll work from that point.
 

russnas

Contributor
Joined
May 31, 2013
Messages
113
what hardware are you running on your freenas? the drive has only been on for 100 days so it should not fail yet.

199 UDMA_CRC_Error_Count - data being sent/received by the drive failed this check
just need to keep an eye on the drive.
 

Vinh

Dabbler
Joined
Apr 5, 2016
Messages
12
Your drive is probably fine. It's suffering from interface issues, which are almost always bad SATA cables.

Replace the SATA cables (keep them as short as is viable and always under 1m) and we'll work from that point.

Thanks for the input Ericloewe. I have already tried multiple cables.

what hardware are you running on your freenas? the drive has only been on for 100 days so it should not fail yet.

199 UDMA_CRC_Error_Count - data being sent/received by the drive failed this check
just need to keep an eye on the drive.

russnas I am running 8 gigs of corsair DDR3 with an intel i5 CPU on an intel motherboard. i have 3 WD Red 3TB drives. last night the drive was was removed from the raid again. i think that possible the sata control on the hdd is going bad because it doesn't always show up in my bios but the other 2 drives due.
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
Note: i5 systems are not recommended, since they don't support ECC RAM.

Do you have any spare SATA ports on your motherboard. If so, move the SATA cable to one of the unused ones.

What motherboard (make/model) do you have?
 

Vinh

Dabbler
Joined
Apr 5, 2016
Messages
12
Note: i5 systems are not recommended, since they don't support ECC RAM.

Do you have any spare SATA ports on your motherboard. If so, move the SATA cable to one of the unused ones.

What motherboard (make/model) do you have?

So i know its recoommend to have ECC RAM but I am not running EEC RAM. i have an Intel dp67bg board. i have tried moving the drive to all of my other empty Sata ports (3 of them) and the problem persist.
 

russnas

Contributor
Joined
May 31, 2013
Messages
113
did you find out the problem?, i think your right about the sata controller, the drive is spitting out errors.

did you check all the blocks in the drive beforehand on windows, i would rma and get them to check.
 

Vinh

Dabbler
Joined
Apr 5, 2016
Messages
12
did you find out the problem?, i think your right about the sata controller, the drive is spitting out errors.

did you check all the blocks in the drive beforehand on windows, i would rma and get them to check.
I think i found the problem i think the chipset controlling my sata ports on my mobo when bad. it has 2 sata ports that transfer at 6Gb/s and 4 at 3 Gb/s. the 2 at 6 work fine but the 4 others wouldn't. i have move to a different mobo now so we will see what happens
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I think i found the problem i think the chipset controlling my sata ports on my mobo when bad. it has 2 sata ports that transfer at 6Gb/s and 4 at 3 Gb/s. the 2 at 6 work fine but the 4 others wouldn't. i have move to a different mobo now so we will see what happens
Are we talking early Sandy Bridge stuff?

Those PCHs had a serious bug in their SATA 3Gb/s ports and had to be replaced by a new stepping.
 

Vinh

Dabbler
Joined
Apr 5, 2016
Messages
12
Are we talking early Sandy Bridge stuff?

Those PCHs had a serious bug in their SATA 3Gb/s ports and had to be replaced by a new stepping.
I think it's pre sandy. I think I have an ivy bridge proc. so is it an issue on the cpu. Like can I swap it out or is it the board?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ivy Bridge is Sandy Bridge's die shrink.

It's quite possible you ended up with one of the C2 stepping PCHs from the initial Sandy Bridge rollout.

In any case, you'd have to replace the motherboard or get a PCI-e HBA.
 
Status
Not open for further replies.
Top